DB Optimizer » Why does the same job run slower on an exact clone?

Why does the same job run slower on an exact clone?

November 15th, 2011

A customer was working with a clone copy of their production database for reports, but the cloned database reports were running slower than the same reports on production and they wanted to know why.

Ostensibly the machines were similar and the memory was the same, though it’s always worth verifying. The first thing I checked was the load profile. If you have OEM you can look at the performance page and/or top activity page to see the load profile. The load profile of production was much more CPU bound and much less I/O bound than the clone and their was clearly a difference.

The customer knew the time they the report was run on the clone but was unsure exactly when it was run on production, though knew it had been run in the past day.

On the clone box, during the report run the load profile using SQL script to reproduce the performance graph looked like :

07-NOV  CLONE
TM     AAS  GRAPH
----- ----  ----------------------------------------------------
11:00   .2  +              6
12:00   .2  -              6
13:00   .1                 6
14:00   .1                 6
15:00   .1                 6
16:00 14.9  ++++++ooooooooo6ooooooooooooooooooooooooooooooooooo-
17:00   .0                 6

The reports were run between 4pm and 5pm (ie 16:00-17:00) and that’s easy to see from the load chart.
The “+” represent CPU, “o” represent I/O, and “-” other wait time.
Now the reports on production were suppose to be run around 7am but the client wasn’t sure.
Here is what the load profile looked like on production looked like

07-NOV PRODUCTION
time AAS GRAPH
----- --- ---------------------------------------------------------------------
00:00 1.5 ++++o
01:00 1.1 +++
02:00  .4 +
03:00  .2 +
04:00  .4 ++
05:00 1.5 ++ooo
06:00 1.6 +++oo
07:00 3.2 ++++++++ooo-
08:00 3.6 ++++++++++ooo--
09:00 6.1 +++++++++++++++++oooooo----
10:00 4.7 +++++++++++++++++ooo
11:00 5.3 +++++++++++++++++++ooo-
12:00 0.0 +++++++++++++++++++++++++++++++++oooooooo---------------------------
13:00 0.5 ++++++++++++++++++++++++oooooooooooooooooooo
14:00 19.2++++++++++++++++++++++++++++++++oooooooooooooooooooooooooooooooooooooo
15:00 6.5 +++++++++++++++oooooo

The customer thought the report had been run at 8am on production and at 8am there is little I/O wait and some CPU. Looking at the top SQL from the clone during the report verses the production database at 8am didn’t show any of the same top SQL. At this point I ask the customer to send me AWR exports of the two databases. I imported the AWR exports giving the clone database DBID=1 and production DBID=2

The first thing I check was the I/O latency in the AWR repository data. Both production and the clone showed I/O averaging 6ms, so latency was unlikely to be the problem.
Clone Latency during the report:

BTIME           EVENT_NAME                        AVG_MS         CT
--------------- ------------------------------ --------- ----------
07-NOV-11 16:00 db file scattered read              6.48        4,246
07-NOV-11 16:00 db file sequential read             8.15    4,760,454   **
07-NOV-11 16:00 direct path read                    3.50      680,192
07-NOV-11 16:00 direct path write                  10.22           14
07-NOV-11 16:00 direct path write temp               .62            2

Production looked like (for example)

BTIME           EVENT_NAME                        AVG_MS         CT
--------------- ------------------------------ --------- ----------
07-NOV-11 14:00 db file scattered read              2.93    148,226
07-NOV-11 14:00 db file sequential read             6.35  4,961,761  **
07-NOV-11 14:00 direct path read                    2.32  2,706,322
07-NOV-11 14:00 direct path write                    .63        204
07-NOV-11 14:00 direct path write temp               .86        820

Thus the clone looks slightly slower, but not enough to explain I/O load seen on the clone.

Then I check for the top SQL during the report which gave:

select
     SQL_ID ,
     sum(decode(session_state,'ON CPU',1,0))     "CPU",
     sum(decode(session_state,'WAITING',1,0))    -
     sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0))    "WAIT" ,
     sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0))    "IO" ,
     sum(decode(session_state,'ON CPU',1,1))     "TOTAL"
from   dba_hist_active_sess_history
where SQL_ID is not NULL
 and  dbid=1
group by sql_id
order by sum(decode(session_state,'ON CPU',1,1))   desc

   SQL_ID           CPU   WAIT     IO  TOTAL
   ------------- ------ ------ ------ ------
  4sfx95xxxxc03   31.0     .0  736.0    767

I wasn’t confident that the customer knew when the report was run on production, so I just searched for these SQL_ID values over the last 12 hours. The top query looks like a radically different I/O profile

select
     SQL_ID ,
     sum(decode(session_state,'ON CPU',1,0))     "CPU",
     sum(decode(session_state,'WAITING',1,0))    -
     sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0))    "WAIT" ,
     sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0))    "IO" ,
     sum(decode(session_state,'ON CPU',1,1))     "TOTAL"
from   dba_hist_active_sess_history
where SQL_ID is not NULL
 and  dbid=2
and sql_id = '4sfx95xxxxc03',
group by sql_id
order by sum(decode(session_state,'ON CPU',1,1))   desc
/

   SQL_ID           CPU   WAIT        IO  TOTAL
   ------------- ------ ------ --------- ------
   4sfx95xxxxc03   12.0     .0     39.00     51

Now I wanted verify the stats for each SQL statement to make sure the query was doing the same work:

VDB
   SNAP_ID      EXECS       DISK        LIO        RWS        CPU     ELAPSE    IO_TIME       READS
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------  ----------
     18798      25648    2746798   21669637      12554 1768040000 7867477859 7430523627     2746594
Prod
     18887     116449      52123    3026366      77023   67260000  377033758  313874605       52122

The query is being executed more on Prod and doing less disk and less CPU.
Now let’s verify it’s execution plan:

      DB SQL_ID        PLAN_HASH_VALUE
--------- ------------- ---------------
    clone 4sfx95xxxxc03      1805550729
     Prod 4sfx95xxxxc03      3038004819

different execution plans
lets look at the two execution plans

(I used a direct query on DBA_HIST_SQL_PLAN but you can also used the package provided by Oracle, for example:

select * from table(dbms_xplan.display_awr('4sfx95xxxxc03',1805550729,1)); -- sql_id, plan_hash, DBID

)

Plan Hash
VDB                                             Production
-----------                                     ------------
1805550729                                      3038004819  

OPERATION
----------------------------------              --------------------
SELECT STATEMENT_                               SELECT STATEMENT_
 SORT_ORDER BY                                    SORT_ORDER BY FILTER_
  FILTER_                                           FILTER_
   NESTED LOOPS_                                    NESTED LOOPS_
    NESTED LOOPS_                                    NESTED LOOPS_
->   TABLE ACCESS_BY INDEX ROWID PAY_CHECK           TABLE ACCESS_BY INDEX ROWID PAY_EARNINGS
->    INDEX_RANGE SCAN I1PAY_CHECK                      INDEX_RANGE SCAN EPAY_EARNINGS
->   INDEX_RANGE SCAN PAY_EARNINGS                   INDEX_UNIQUE SCAN PAY_CHECK
->  TABLE ACCESS_BY INDEX ROWID PAY_EARNINGS        TABLE ACCESS_BY INDEX ROWID PAY_CHECK
   TABLE ACCESS_BY INDEX ROWID PAY_CHECK           TABLE ACCESS_BY INDEX ROWID PAY_CHECK
    INDEX_RANGE SCAN I0PAY_CHECK                      INDEX_RANGE SCAN I0PAY_CHECK

So the execution plans have changes. Now why have they changed? That’s a story for another day, but the plan would only change if the data and/or the statistics on the tables had changed.

REFERENCE: DBA HIST Views

Kyle Hailey
Uncategorized