Why does the same job run slower on an exact clone?

November 15th, 2011

A customer  was working with a clone copy of their production database for reports, but the cloned database reports were running slower than the same reports on production and they wanted to know why.

Ostensibly the machines were similar and the memory was the same, though it’s always worth verifying.  The first thing I checked was the load profile. If you have OEM you can look at the performance page and/or top activity page to see the load profile.  The load profile of production was much more CPU bound and much less I/O bound than the clone and their was clearly a difference.

The customer knew the time they the report was run on the clone  but was unsure exactly when it was run on production, though knew it had been run in the past day.

On the clone box, during the report run the load profile using SQL script to reproduce the performance graph looked like :

07-NOV  CLONE
TM     AAS  GRAPH
----- ----  ----------------------------------------------------
11:00   .2  +              6
12:00   .2  -              6
13:00   .1                 6
14:00   .1                 6
15:00   .1                 6
16:00 14.9  ++++++ooooooooo6ooooooooooooooooooooooooooooooooooo-
17:00   .0                 6

The reports were run between 4pm and 5pm  (ie 16:00-17:00) and that’s easy to see from the load chart.
The “+” represent CPU, “o” represent I/O, and “-” other wait time.
Now the reports on production were suppose to be run around 7am but the client wasn’t sure.
Here is what the load profile looked like on production looked like

07-NOV PRODUCTION
time AAS GRAPH
----- --- ---------------------------------------------------------------------
00:00 1.5 ++++o
01:00 1.1 +++
02:00  .4 +
03:00  .2 +
04:00  .4 ++
05:00 1.5 ++ooo
06:00 1.6 +++oo
07:00 3.2 ++++++++ooo-
08:00 3.6 ++++++++++ooo--
09:00 6.1 +++++++++++++++++oooooo----
10:00 4.7 +++++++++++++++++ooo
11:00 5.3 +++++++++++++++++++ooo-
12:00 0.0 +++++++++++++++++++++++++++++++++oooooooo---------------------------
13:00 0.5 ++++++++++++++++++++++++oooooooooooooooooooo
14:00 19.2++++++++++++++++++++++++++++++++oooooooooooooooooooooooooooooooooooooo
15:00 6.5 +++++++++++++++oooooo

The customer thought the report had been run at 8am on production and at 8am there is little I/O wait and some  CPU. Looking at the top SQL from the clone during the report verses the production database at 8am didn’t show any of the same top SQL.  At this point I ask the customer to send me AWR exports of the two databases. I imported the AWR exports giving the clone database DBID=1 and production DBID=2

The first thing I check was the I/O latency in the AWR repository data. Both production and the clone showed I/O averaging 6ms, so latency was unlikely to be the problem.
Clone Latency  during the report:

BTIME           EVENT_NAME                        AVG_MS         CT
--------------- ------------------------------ --------- ----------
07-NOV-11 16:00 db file scattered read              6.48        4,246
07-NOV-11 16:00 db file sequential read             8.15    4,760,454   **
07-NOV-11 16:00 direct path read                    3.50      680,192
07-NOV-11 16:00 direct path write                  10.22           14
07-NOV-11 16:00 direct path write temp               .62            2

Production looked like (for example)

BTIME           EVENT_NAME                        AVG_MS         CT
--------------- ------------------------------ --------- ----------
07-NOV-11 14:00 db file scattered read              2.93    148,226
07-NOV-11 14:00 db file sequential read             6.35  4,961,761  **
07-NOV-11 14:00 direct path read                    2.32  2,706,322
07-NOV-11 14:00 direct path write                    .63        204
07-NOV-11 14:00 direct path write temp               .86        820

Thus the clone looks slightly slower,  but not enough to explain I/O load seen on the clone.

Then I check for the top SQL during the report which gave:

select
     SQL_ID ,
     sum(decode(session_state,'ON CPU',1,0))     "CPU",
     sum(decode(session_state,'WAITING',1,0))    -
     sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0))    "WAIT" ,
     sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0))    "IO" ,
     sum(decode(session_state,'ON CPU',1,1))     "TOTAL"
from   dba_hist_active_sess_history
where SQL_ID is not NULL
 and  dbid=1
group by sql_id
order by sum(decode(session_state,'ON CPU',1,1))   desc

 

   SQL_ID           CPU   WAIT     IO  TOTAL
   ------------- ------ ------ ------ ------
  4sfx95xxxxc03   31.0     .0  736.0    767

I wasn’t confident that the customer knew when the report was run on production, so I just searched for these SQL_ID values over the last 12 hours. The top query looks like a radically different I/O profile

select
     SQL_ID ,
     sum(decode(session_state,'ON CPU',1,0))     "CPU",
     sum(decode(session_state,'WAITING',1,0))    -
     sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0))    "WAIT" ,
     sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0))    "IO" ,
     sum(decode(session_state,'ON CPU',1,1))     "TOTAL"
from   dba_hist_active_sess_history
where SQL_ID is not NULL
 and  dbid=2
and sql_id = '4sfx95xxxxc03',
group by sql_id
order by sum(decode(session_state,'ON CPU',1,1))   desc
/

 

   SQL_ID           CPU   WAIT        IO  TOTAL
   ------------- ------ ------ --------- ------
   4sfx95xxxxc03   12.0     .0     39.00     51

Now I wanted verify the stats for each SQL statement to make sure the query was doing the same work:

VDB
   SNAP_ID      EXECS       DISK        LIO        RWS        CPU     ELAPSE    IO_TIME       READS
---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------  ----------
     18798      25648    2746798   21669637      12554 1768040000 7867477859 7430523627     2746594
Prod
     18887     116449      52123    3026366      77023   67260000  377033758  313874605       52122

The query is being executed more on Prod and doing less disk and less CPU.
Now let’s verify it’s execution plan:

      DB SQL_ID        PLAN_HASH_VALUE
--------- ------------- ---------------
    clone 4sfx95xxxxc03      1805550729
     Prod 4sfx95xxxxc03      3038004819

different execution plans
lets look at the two execution plans

(I used a direct query on DBA_HIST_SQL_PLAN but you can also used the package provided by Oracle, for example:

select * from table(dbms_xplan.display_awr('4sfx95xxxxc03',1805550729,1)); -- sql_id, plan_hash, DBID

)

Plan Hash
VDB                                             Production
-----------                                     ------------
1805550729                                      3038004819  

OPERATION
----------------------------------              --------------------
SELECT STATEMENT_                               SELECT STATEMENT_
 SORT_ORDER BY                                    SORT_ORDER BY FILTER_
  FILTER_                                           FILTER_
   NESTED LOOPS_                                    NESTED LOOPS_
    NESTED LOOPS_                                    NESTED LOOPS_
->   TABLE ACCESS_BY INDEX ROWID PAY_CHECK           TABLE ACCESS_BY INDEX ROWID PAY_EARNINGS
->    INDEX_RANGE SCAN I1PAY_CHECK                      INDEX_RANGE SCAN EPAY_EARNINGS
->   INDEX_RANGE SCAN PAY_EARNINGS                   INDEX_UNIQUE SCAN PAY_CHECK
->  TABLE ACCESS_BY INDEX ROWID PAY_EARNINGS        TABLE ACCESS_BY INDEX ROWID PAY_CHECK
   TABLE ACCESS_BY INDEX ROWID PAY_CHECK           TABLE ACCESS_BY INDEX ROWID PAY_CHECK
    INDEX_RANGE SCAN I0PAY_CHECK                      INDEX_RANGE SCAN I0PAY_CHECK

So the execution plans have changes. Now why have they changed? That’s a story for another day, but the plan would only change if the data and/or the statistics on the tables had changed.

REFERENCE: DBA HIST Views


Uncategorized

  1. Trackbacks

  2. No trackbacks yet.
  1. Comments

  2. Vishal Desai
    November 15th, 2011 at 19:37 | #1

    Hi Kyle,

    What method was used to clone production database ?

    Thanks,
    Vishal

  3. November 16th, 2011 at 01:12 | #2

    The database was cloned using Delphix

  4. Tom Robbins
    December 19th, 2012 at 15:25 | #3

    Kyle,
    We are using Delphix and it is great for creating a fresh test environment as a clone of production.

    One thing we are struggling with, though, is a large INSERT statement of a 5GB table:
    even though we have PARALLEL degree set to the default for the database, and we have enabled parallel DML, the INSERT still runs single-threaded, and therefore takes a LONG time (72 minutes).

    Is there something in Delphix that causes this single-threadedness?

    Thanks,
    Tom

  5. December 19th, 2012 at 18:46 | #4

    @Tom Robbins: There is nothing inherent in a virtual database that would make parallel go to single threaded. Have you compared the init.ora/spfile parameters from the two databases? By default, Delphix will keep most of the init.ora parameters the same as the source but there are a few that can be removed or changed for various reasons. Also if you have created an init.ora template for your virtual databases then of course that could explain differences between the source database and virtual database.
    The first things to look at
    1. init.ora/spfile on source and virtual
    2. table parallel settings on source and virtual
    3. of course compare the insert code itself on source and virtual
    Feel free to email me at kyle.hailey@delphix.com

  6. Tom Robbins
    January 8th, 2013 at 02:29 | #5

    @Kyle Hailey
    Thanks. for some reason, I had not looked at the dbwr parallel settings in the pfile/spfile….

  7. jsm
    January 18th, 2013 at 16:06 | #6

    Take a look at Carlos Sierra’s SQLT tool downloadable from support. It has an “extract” method and a “compare” method. Compare will tell you EXACTLY what is different between the two extracts down to the nitty gritty of the optimizer, the plan, and the details of all the supporting statistics.

  8. January 18th, 2013 at 18:35 | #7

    @jsm: Carlos’ tools are awesome. Good timing on your post because our NoCOUG board is hosting Carlos for a 2 day sql tuning seminar with his tools!
    https://nocoug.wildapricot.org/Events?eventId=605183
    Feb 19-20, 2013
    CarrAmerica Conference Center, Pleasanton, CA
    8 Spaces left as of today


+ 9 = seventeen