Why does the same job run slower on an exact clone?
A customer was working with a clone copy of their production database for reports, but the cloned database reports were running slower than the same reports on production and they wanted to know why.
Ostensibly the machines were similar and the memory was the same, though it’s always worth verifying. The first thing I checked was the load profile. If you have OEM you can look at the performance page and/or top activity page to see the load profile. The load profile of production was much more CPU bound and much less I/O bound than the clone and their was clearly a difference.
The customer knew the time they the report was run on the clone but was unsure exactly when it was run on production, though knew it had been run in the past day.
On the clone box, during the report run the load profile using SQL script to reproduce the performance graph looked like :
07-NOV CLONE TM AAS GRAPH ----- ---- ---------------------------------------------------- 11:00 .2 + 6 12:00 .2 - 6 13:00 .1 6 14:00 .1 6 15:00 .1 6 16:00 14.9 ++++++ooooooooo6ooooooooooooooooooooooooooooooooooo- 17:00 .0 6
The reports were run between 4pm and 5pm (ie 16:00-17:00) and that’s easy to see from the load chart.
The “+” represent CPU, “o” represent I/O, and “-” other wait time.
Now the reports on production were suppose to be run around 7am but the client wasn’t sure.
Here is what the load profile looked like on production looked like
07-NOV PRODUCTION time AAS GRAPH ----- --- --------------------------------------------------------------------- 00:00 1.5 ++++o 01:00 1.1 +++ 02:00 .4 + 03:00 .2 + 04:00 .4 ++ 05:00 1.5 ++ooo 06:00 1.6 +++oo 07:00 3.2 ++++++++ooo- 08:00 3.6 ++++++++++ooo-- 09:00 6.1 +++++++++++++++++oooooo---- 10:00 4.7 +++++++++++++++++ooo 11:00 5.3 +++++++++++++++++++ooo- 12:00 0.0 +++++++++++++++++++++++++++++++++oooooooo--------------------------- 13:00 0.5 ++++++++++++++++++++++++oooooooooooooooooooo 14:00 19.2++++++++++++++++++++++++++++++++oooooooooooooooooooooooooooooooooooooo 15:00 6.5 +++++++++++++++oooooo
The customer thought the report had been run at 8am on production and at 8am there is little I/O wait and some CPU. Looking at the top SQL from the clone during the report verses the production database at 8am didn’t show any of the same top SQL. At this point I ask the customer to send me AWR exports of the two databases. I imported the AWR exports giving the clone database DBID=1 and production DBID=2
The first thing I check was the I/O latency in the AWR repository data. Both production and the clone showed I/O averaging 6ms, so latency was unlikely to be the problem.
Clone Latency during the report:
BTIME EVENT_NAME AVG_MS CT --------------- ------------------------------ --------- ---------- 07-NOV-11 16:00 db file scattered read 6.48 4,246 07-NOV-11 16:00 db file sequential read 8.15 4,760,454 ** 07-NOV-11 16:00 direct path read 3.50 680,192 07-NOV-11 16:00 direct path write 10.22 14 07-NOV-11 16:00 direct path write temp .62 2
Production looked like (for example)
BTIME EVENT_NAME AVG_MS CT --------------- ------------------------------ --------- ---------- 07-NOV-11 14:00 db file scattered read 2.93 148,226 07-NOV-11 14:00 db file sequential read 6.35 4,961,761 ** 07-NOV-11 14:00 direct path read 2.32 2,706,322 07-NOV-11 14:00 direct path write .63 204 07-NOV-11 14:00 direct path write temp .86 820
Thus the clone looks slightly slower, but not enough to explain I/O load seen on the clone.
Then I check for the top SQL during the report which gave:
select SQL_ID , sum(decode(session_state,'ON CPU',1,0)) "CPU", sum(decode(session_state,'WAITING',1,0)) - sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0)) "WAIT" , sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0)) "IO" , sum(decode(session_state,'ON CPU',1,1)) "TOTAL" from dba_hist_active_sess_history where SQL_ID is not NULL and dbid=1 group by sql_id order by sum(decode(session_state,'ON CPU',1,1)) desc
SQL_ID CPU WAIT IO TOTAL ------------- ------ ------ ------ ------ 4sfx95xxxxc03 31.0 .0 736.0 767
I wasn’t confident that the customer knew when the report was run on production, so I just searched for these SQL_ID values over the last 12 hours. The top query looks like a radically different I/O profile
select SQL_ID , sum(decode(session_state,'ON CPU',1,0)) "CPU", sum(decode(session_state,'WAITING',1,0)) - sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0)) "WAIT" , sum(decode(session_state,'WAITING', decode(wait_class, 'User I/O',1,0),0)) "IO" , sum(decode(session_state,'ON CPU',1,1)) "TOTAL" from dba_hist_active_sess_history where SQL_ID is not NULL and dbid=2 and sql_id = '4sfx95xxxxc03', group by sql_id order by sum(decode(session_state,'ON CPU',1,1)) desc /
SQL_ID CPU WAIT IO TOTAL ------------- ------ ------ --------- ------ 4sfx95xxxxc03 12.0 .0 39.00 51
Now I wanted verify the stats for each SQL statement to make sure the query was doing the same work:
VDB SNAP_ID EXECS DISK LIO RWS CPU ELAPSE IO_TIME READS ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 18798 25648 2746798 21669637 12554 1768040000 7867477859 7430523627 2746594 Prod 18887 116449 52123 3026366 77023 67260000 377033758 313874605 52122
The query is being executed more on Prod and doing less disk and less CPU.
Now let’s verify it’s execution plan:
DB SQL_ID PLAN_HASH_VALUE --------- ------------- --------------- clone 4sfx95xxxxc03 1805550729 Prod 4sfx95xxxxc03 3038004819
different execution plans
lets look at the two execution plans
(I used a direct query on DBA_HIST_SQL_PLAN but you can also used the package provided by Oracle, for example:
select * from table(dbms_xplan.display_awr('4sfx95xxxxc03',1805550729,1)); -- sql_id, plan_hash, DBID
)
Plan Hash VDB Production ----------- ------------ 1805550729 3038004819 OPERATION ---------------------------------- -------------------- SELECT STATEMENT_ SELECT STATEMENT_ SORT_ORDER BY SORT_ORDER BY FILTER_ FILTER_ FILTER_ NESTED LOOPS_ NESTED LOOPS_ NESTED LOOPS_ NESTED LOOPS_ -> TABLE ACCESS_BY INDEX ROWID PAY_CHECK TABLE ACCESS_BY INDEX ROWID PAY_EARNINGS -> INDEX_RANGE SCAN I1PAY_CHECK INDEX_RANGE SCAN EPAY_EARNINGS -> INDEX_RANGE SCAN PAY_EARNINGS INDEX_UNIQUE SCAN PAY_CHECK -> TABLE ACCESS_BY INDEX ROWID PAY_EARNINGS TABLE ACCESS_BY INDEX ROWID PAY_CHECK TABLE ACCESS_BY INDEX ROWID PAY_CHECK TABLE ACCESS_BY INDEX ROWID PAY_CHECK INDEX_RANGE SCAN I0PAY_CHECK INDEX_RANGE SCAN I0PAY_CHECK
So the execution plans have changes. Now why have they changed? That’s a story for another day, but the plan would only change if the data and/or the statistics on the tables had changed.
REFERENCE: DBA HIST Views
Trackbacks
Comments
Hi Kyle,
What method was used to clone production database ?
Thanks,
Vishal
The database was cloned using Delphix
Kyle,
We are using Delphix and it is great for creating a fresh test environment as a clone of production.
One thing we are struggling with, though, is a large INSERT statement of a 5GB table:
even though we have PARALLEL degree set to the default for the database, and we have enabled parallel DML, the INSERT still runs single-threaded, and therefore takes a LONG time (72 minutes).
Is there something in Delphix that causes this single-threadedness?
Thanks,
Tom
@Tom Robbins: There is nothing inherent in a virtual database that would make parallel go to single threaded. Have you compared the init.ora/spfile parameters from the two databases? By default, Delphix will keep most of the init.ora parameters the same as the source but there are a few that can be removed or changed for various reasons. Also if you have created an init.ora template for your virtual databases then of course that could explain differences between the source database and virtual database.
The first things to look at
1. init.ora/spfile on source and virtual
2. table parallel settings on source and virtual
3. of course compare the insert code itself on source and virtual
Feel free to email me at
@Kyle Hailey
Thanks. for some reason, I had not looked at the dbwr parallel settings in the pfile/spfile….
Take a look at Carlos Sierra’s SQLT tool downloadable from support. It has an “extract” method and a “compare” method. Compare will tell you EXACTLY what is different between the two extracts down to the nitty gritty of the optimizer, the plan, and the details of all the supporting statistics.
@jsm: Carlos’ tools are awesome. Good timing on your post because our NoCOUG board is hosting Carlos for a 2 day sql tuning seminar with his tools!
https://nocoug.wildapricot.org/Events?eventId=605183
Feb 19-20, 2013
CarrAmerica Conference Center, Pleasanton, CA
8 Spaces left as of today