Live webcast: Jonathan Lewis – An Oracle Expert’s Look at the Delphix Technology
Jonathan Lewis came out to the Delphix offices in Menlo Park, CA to kick the tires on our product for a week.
Join Jonathan Lewis and myself for a discussion and sharing of first impressions of Delphix. Jonathan worked closely with the Delphix team investigating performance, comparing technologies, and testing community-requested use cases. In this chat, Jonathan and I will have an informal discussion on the preliminary results of Jonathan’s first-hand experiences with Delphix, working closely with our team to learn about the functional aims, implementation strategies, and underlying technologies. As part of his hands-on activity, Jonathan started with the simple provisioning benefits of Delphix, and went on to look at the performance implications of various scenarios, including test cases suggested by the Oracle community. For a glimpse into what will be discussed see Jonathan’s blog post on his visit to Delphix.
Requires tethereal which comes with the wireshark install on LINUX:
yum install wireshark
On NFS server the latency of 8K block sends was measured with DTrace and showed:
On the NFS client, the Oracle database is measured with oramon.sh and shows the 8k read latency as
There is a substantial difference between the latency seen by NFS server and the latency seen by Oracle. Where is the latency coming from? Is it a problem on the NFS server, the NFS client or the network? How do we find out? One strategy is to collect TCP dumps on the NFS client and NFS server and compare them.
Collecting TCP traces
First, set up a situation where there is a load running on the database and there is a discrepancy between latency on NFS server and as seen by Oracle. During this period of latency discrepancy collect TCP traces. The traces will be large, so run them for 20 second for example, to limit the size.
Collecting TCP dumps
TCP dumps can be collected on LINUX with tcpdump and on Solaris with snoop:
-w : output file, give “.cap” extension to be recognized automatically by wireshark
host : IP of NFS server
Analyzing the Traces
The trace files can be analyzed on a linux machine with wireshark installed and with the script parsetcp.pl on the two traces.
$ ./parsetcp.pl nfs_server.cap nfs_client.cap
==================== RAW DATA ============
Parsing NFS server trace: nfs_server.cap
type avg ms count
READ : 18.10, 12493
WRITE : 30.08, 4918
Parsing NFS client trace: nfs_client.cap
type avg ms count
READ : 18.74, 25974
WRITE : 30.05, 10197
==================== MATCHED DATA ============
type avg ms
nfs server: 18.12
diff : 1.14
Processed 14010 packets (Matched: 11964 Missed: 2046)
type avg ms
server : 30.86
client : 31.83
diff : 0.96
Processed 5806 packets (Matched: 4391 Missed: 1415)
The script will find the average latency for reads and writes on both ends and then try to match the packets found both at the client and server and report the average latency for reads and writes on the packets that match.
In the above output the read time on on the NFS client receiving end is on par with what is seen on NFS server.
Thus if the receive time on the NFS client host is on par with the NFS server and Oracle latency is much greater than the receive time, then there looks to be some issue on the NFS client host machine such as heavy memory paging that would cause the latency seen by Oracle to be much higher than that seen by the tcp traces.
Example problem analysis
In the following 2 examples a swingbench load was run on the Oracle database on the NFS client. The database files were mounted via NFS from the NFS server. In one case the NFS client and database host was Open Solaris and in the other case it was LINUX. In both cases the Oracle database reported much slower latency than the NFS server threads.
test 1 ) Open Solaris is the NFS client
Open Solaris is the NFS Server as well
NFS Server latency ( from DTrace of nfs start/done )
What do the tcp traces say? Tracing TCP on the NFS server side, and NFS client side indicates that the slow latencies are coming from the NFS server
==================== Individual HOST DATA ============
Parsing nfs server trace: nfs_server.cap
type avg ms count
READ : 44.60, 7731
Parsing client trace: client.cap
type avg ms count
READ : 46.54, 15282
==================== MATCHED DATA ============
type avg ms
server : 48.39,
client : 49.42,
diff : 1.03,
Processed 9647 packets (Matched: 5624 Missed: 4023)
TEST 1 summary:
The NFS Server says it is responding with 1ms latency (from DTrace) but the TCP traces show latency more on the order of 48ms which is in alignment with what Oracle sees on the NFS client, thus the NFS server must be introducing the large latency somehow and not the client.
Again we see the same stats as the first test case, ie the NFS Server says it’s latency is low and the Oracle database on the NFS client says the latency is slow
TCP trace analysis
==================== Individual HOST DATA ============
Parsing NFS server trace: nfs_server.cap
type avg ms count
READ : 1.17, 9042
Parsing client trace: client.cap
type avg ms count
READ : 1.49, 21984
==================== MATCHED DATA ============
type avg ms count
server : 1.03
client : 1.49
diff : 0.46
Processed 13815 packets (Matched: 8169 Missed: 5646) summary of test 2:
In this case the NFS Server TCP traces show fast latency as well as the NFS client TCP traces, thus the long latency times must be introduced on the client machine somehow and not the NFS Server.
Summary of test 1 and test 2
Tests are run with swingbench benchmark.
The Oracle databases are set with the parameter “filesystemio_options=SETALL” which enables direct I/O, so that all I/O requests from Oracle should be become NFS requests to NFS server.
Roughly the latency stats for 8k reads originating from the Oracle are
Oracle on Linux
Oracle on Solaris
latency data source
TCP trace NFS Server
TCP trace NFS Client
tcpdump on LINUX and snoop on Open Solaris
“db file sequential read” wait (which is basically a timing of “pread” for 8k random reads specifically
In the above examples there was nothing outstanding in the vmstat results on the NFS client and the NFS server.
In both cases the problem arose with high level of swingbench sessions. The latency disparity appeared above a user load of 16. The latency disparity increased as user load increased.
It turned out that in the case where the NFS server was introducing the disparity that the maximum NFS threads on the server was 16. The solution was to increase the NFS server thread max: sharectl set -p servers=512 nfs.
In the case where the LINUX client introduce the disparity it turns out that the maximum outstanding NFS requests was 16. The solution was to raise sunrpc.tcp_slot_table_entries = 128.
The tcpdump.pl script didn’t find the answers but it did identify which side of the connection was at fault and it would identify if the network itself was the problem.
One thing to keep in mind is that snoop and tcpdump can be resource intensive and may introduce a major amount of latency. In the above cases with and without tcp tracing enabled showed no more than 1ms of difference and the latency being investigated was an order of magnitude greater. In other cases where there has been more I/O between the NFS server and client and/or the transfer sizes were higher (like 1M) then the impact of TCP tracing has eclipsed the latency disparity being investigated and tcp tracing would not have been appropriate investigation path.
Given a description of gas dynamics and the atmosphere, you would be hard to pressed forecast tornadoes. The term emergence denotes the phenomena of surprising behaviors arising in complex systems. Modern storage systems are complex, and full of emergent behavior that make forecasting application I/O performance fiendishly difficult.
In collaboration with Kyle Hailey, Adam Leventhal, and others I’ve learned some rules of thumb for how to make accurate I/O performance forecasts. I’ve also stepped on every rake imaginable along the way. For those of you who also may wish to forecast the storage performance an application will receive, this post summarizes some lessons learned.
When I began evaluating storage performance, I had a naive idea that the process would be like this:
I quickly discovered that the actual process was much more like this:
In going through this process, I’ve compiled a bestiary of performance anomalies you may expect to encounter if you interact with a variety of storage platforms, along with root causes for those anomalies and some explanatory notes. Following that are some approaches for designing I/O simulations or tests that avoid them.
Bestiary of I/O Performance Anomalies
Impossibly good performance:
Higher throughput than the connection to storage could provide
Latencies which imply faster than light travel over cables to the storage
Often the operating system and the underlying storage array will have large memory caches. Drivers will also tend to cache small amounts of data. This mostly occurs during read tests, but depending on the application write semantics can also occur during write tests.
It is common in storage systems to allocate LUNs or file systems from storage pools that are composed of large numbers of physical drives shared with other LUNs.
Shared connection to storage
Inconsistent performance, especially for:
High throughput tests
NAS storage with a 1 GB Ethernet connection
For storage tests being done within a VM, other VMs on the same physical server can contend with your tests for access to the storage.
I/O request consolidation
Somewhat paradoxically both higher latency and higher throughput than expected. Particularly common for small sequential non-O_[D]SYNC writes
Various I/O layers can group together multiple I/Os issued by your application before issuing them to the storage or a lower layer.
I/O request fragmentation
Higher latency and lower throughput than expected, particularly for large I/Os or NFS based NAS storage
Large application I/O requests can be broken down into multiple, smaller I/Os that are issued serially by intervening layers.
Improbably good sequential read performance
Unexpectedly poor random I/O performance
Performance that changes dramatically midway through a test
Many layers may decide to Read Ahead – that is to optimistically fetch data adjacent to the requested data in case it is needed. If you have a sequential read workload, read ahead will substantially improve performance. If you have a random read workload, read ahead ensures the storage subsystem components will be doing a lot of unnecessary work that may degrade performance. Finally, some systems will try to discern the random or sequential nature of your workload and dynamically enable / disable read ahead. This can lead to inconsistent behavior, for example a sequential read test may start slowly and then speed up once read ahead kicks in.
Tiered storage migration
Unexpectedly bad performance, especially during initial tests on high powered SANs such as EMC VMAX
Some storage systems cleverly use a mix of very high performance flash drives, fast hard disks, and slower large capacity hard disks. These systems dynamically move data among these tiers depending on their access patterns. Often data newly created for a test will be initially located on the slow high capacity disks – I have seen 8 kB random read latencies averaging around 20 ms, with spikes to around 100 ms, for initial tests on these kinds of ‘high performance’ ‘Enterprise’ storage systems.
First write penalty
Unexpectedly bad write performance, especially if it happens early in testing and is not reproducible
Many storage systems, volume managers, and some file systems will use some form of thin provisioning. In these systems when an initial write happens into some region, additional overhead is required, such as adjusting some meta-data and formatting the region. Subsequent writes to the same region will be faster. For example, a thin provisioned VMDK on VMware must be zeroed on first write – so a 1 kB application write can trigger a write of an entire VMFS block of 1 megabyte or more.
Unexpectedly good read performance on raw devices or regions that have not been written
Some file systems and storage systems know whether a region has been written to. Attempts to read from uninitialized regions can result in an immediate software provided response of: “Here you go, all zeros!” – without actually engaging the disk hardware at all. Both VMFS and ZFS will do this, depending on configuration.
Unexpected, or even impossibly good write or read performance
Some file systems will compress data. If your I/O test is writing out a pattern that compresses well (such as all 0s or all 1s), the amount of I/O submitted to and read from the physical storage will be a tiny fraction of your test’s intended I/O workload.
Unexpectedly poor performance
Often when I speak to a storage administrator after getting unacceptable performance results, I learn there was some kind of maintenance happening at the time, such as migration of data to another storage pool, rebuilding of RAID configurations, etc.
Avoiding Anomalies While Testing
Here is a summary of how to avoid these anomalies, with details below:
Use a real workload if possible
When simulating, be sure to simulate the actual application workload
Evaluate latencies using histograms, not averages
Verify your performance tests give reproducible results
Run test at the same time as the production application will run, and for sufficiently long durations
Ensure the test data is similar to what the application will use and produce
Use a real workload if possible. Unfortunately, often this isn’t be possible. For example, you probably won’t be able to determine the exact workload of the month end close for the new ERP system while that system is being architected, which is when you’ll need to design and select the storage.
When you must simulate, be sure to simulate what the application actually does in terms of I/O. This means understanding the read and write mix, I/O sizes and rates, as well as the semantics of the I/O that is issued. For example: are writes O_SYNC or O_DSYNC, is Direct I/O used? fio is an amazing tool for performing I/O simulation and tests, it can reproduce most application workloads, has good platform support, and an active development and user community.
When measuring I/O performance, be sure to use histograms to evaluate latencies rather that looking just at averages. Histograms show the existence of anomalies, and can clarify the presence of caches as well as the actual I/O performance that the disks are delivering. See, for example, these two images from an actual test on customer system:
First a sequential read I/O test was run, followed by a random read I/O test. If we looked only at averages, we would have seen a sequential read latency of around 4 ms, quite good. Looking at the histogram distribution however, it is clear we are getting a mix of 10-20 ms disk latencies, and 0.5-2 ms latencies, presumably from cache. In the subsequent random I/O test we see the improbably good performance with an average of 1 ms and I/Os ranging as low as 100 microseconds. Clearly our working set has been mostly cached here – we can see the few actual random read disk accesses that are occurring by looking at the small bar in the 10-20 ms range. Without histograms it would be easy to mistakenly conclude that the storage was screaming fast and we would not see latencies over 5 ms. For this reason, at Delphix, we use the 95th percentile latency as the guideline for how storage is responding to our tests. Again,fio is an excellent tool for I/O testing that reports latency histogram information.
Run tests multiple times to verify reproducibility. If your second and third test runs show different results than the first, none are good basis for making a forecast of eventual application performance. If performance is increasing for later tests, most likely your data is becoming cached. If performance moves up and down, most likely you are on shared infrastructure.
Since shared infrastructure is common, run at least one test at the same time as when the key workload will run. On shared infrastructure it is important to test at the time when the actual application performance will matter. For example, test during the peak load times of your application, not overnight or on a weekend. As an aside, I am occasionally misinformed by customers that the infrastructure is not shared, only to learn later that it is.
For read tests, ensure your test data size is comparable with the size of the eventual application data – or at least much larger than any intervening caches. For example, if you are developing a 1 TB OLTP system try to test over 1 TB of data files. Typical SANs have order 10 GB of cache shared among multiple users. Many operating systems (notably Linux and Solaris) will tend to use all available system RAM as a read cache. This suggests 100 GB would be the absolute minimum test data size that wouldn’t see substantial caching.
Run tests for long enough so that ramp up or dynamic workload detection changes don’t contribute substantially to your result. In practice, I find an adequate duration by running the test workloads over and over while progressively doubling the duration until I get two runs whose performance is within a close margin of one another.
Initialize your test files with data similar to what your application will use. This avoids first write penalties in the test, and ensures your tests are consistent with application performance when the storage uses compression.
Architects must often forecast the I/O performance an application will receive from existing storage. Making an accurate forecast is surprisingly tricky. If you are trying to use a test workload to evaluate storage, there is a risk that the test will trigger some anomaly that makes the results invalid. If you are lucky, these invalid results will be clearly impossible and lead you to do more testing, if you are unlucky they will appear reasonable problems will arise during the production roll out of the application.
An awareness of common root causes of I/O performance anomalies, and some rules of thumb for avoiding them while testing, can improve the accuracy of a performance forecast and reduce risks to an application roll out.
Delphix allows companies to make copies of Oracle databases in minutes with almost no storage overhead by sharing all the duplicate data blocks between database copies. Sharing duplicate data blocks massively reduces storage consumption, frees DBAs up from the repetitive work of copying files , allowing them to concentrate on innovative work and most importantly accelerates project development by eliminating the time and resource consuming bottleneck of provisioning databases for development, QA and reporting. With Delphix the databases can be provisioned in minutes to any point in time from the source database allowing one to easily go back before an incident on production to retrieve data before a problem or stand up a full copy of production at a time that production had a problem allowing easy analysis of intermittent issues.
A database refresh in 15 minutes?
That is mind blowing! Delphix nailed it for us.
- Matt Lawrence , Sr Director Wind River (Intel)
Here is an example of virtualizing 1TB in 120 seconds by provisioning five 200GB databases in Delphix
In the demo I click on the source database in the top left. A source database is represented by the S in a black circle. After clicking on the source, I see the source snapshots appear in the middle screen. I take the default which is the most recent snapshot. I then click on “Provision” in the bottom middle. I then choose the target machine that has the Oracle binaries and will run the clone instance. After that I accept the defaults and it creates the virtual database. I then do that 4 more times creating a total of five database clones of a 200GB database. For a physical copies this would take a considerable amount of time, space and expertise.
Lots of good presentations, discussions and interactions with people at RMOUG. Kudos goes out to all the RMOUG team for putting on an awesome conference. Thanks to Tim Gorman, Kellyn Pot’vin and all the dedicated volunteers.
My favorite presentation was by Jordan Meyer of Rittman Mead consulting. He did social media analysis with R. His R source code is available at
He took as an example the speakers from RMOUG and looked at their relationships on Twitter. Using this data he illustrated different network relationships such as
nodes an edges
directed and undirected
diameter and shortest path
Clustering Coefficient (Density)
Here are the speakers and their connections on Twitter
Here are the speakers and their “Betweeness Centrality”
Here are the speakers that are the core of the community
Here are the speakers that are the more relaxed core of the community
I also enjoyed Martin Widlake’s presentation on “The First Five Things to Know About Exadata“, Karl Arao’s talk on “A Consolidation Success Story” and Fritz Hoogland’s presentation on “About Multiblock Read“.
Fritz showed some cool ways of tracing processes and show the surprising fact that Oracle doesn’t always record every wait event in some cases when a wait event is fast enough. This applies to direct path reads in particular.
Martin’s presentation was a great overview of Exadata from the persepective of someone who knows Oracle well but has never worked on an Exadata.
Between Martin’s talk and discussions with many people there on Exadata it seems clear to me that Exadata is great for data warehouse but can actually be a negative impact for OLTP. OLTP apparently is better, not to mention cheaper, on ODA.
I spent much of my time at the Delphix booth which was great fun. People seemed to get the value of Delphix immediately in one sentence: “Provision database clones in 2 minutes with almost no storage overhead by sharing duplicate blocks across all database copies “. We gave two talks on database virtualization and both generated a flood of people visiting our booth. We scanned almost 200 people at the booth or about 25% of the conference.
I’ve mentioned lab 128 before but I have to mentioned it again. Every DBA should know about this tool. It’s the fastest way to connect to an Oracle database and get some hard core performance information nicely visualized graphically.
Lab 128 starts up in seconds, like 1 or 2 seconds. Its super fast and light weight but powerful. My older version only took 10M of RAM. The newest version is about 47M but still as fast as ever.
Lab128 can be used to monitor 15-20 instances. Lab 128 keeps 13-14 hours of performance data in memory, and last 60 days on the disk. Monitoring that many instances for that long takes around 800-900 MB of RAM and well less than 10% of the CPU on a laptop.
Below is my favorite screen, the “Activity Explorer” page which is somewhat like the top Activity screen in OEM but it shows both the data from ASH and the data from wait events, so you can compare the two. The ASH data can be collected by the tool avoiding the diagnostics and performance pack license or if you have the license it can connect to ASH from the database views.
You can monitor several databases at a time
Here is the top session screen showing session state, session types and for active queries, the estimated completion % for the query.
Jonathan Lewis has graciously accepted an offer to come out to sunny California next month and spend a few days at Delphix ! Jonathan will be putting Delphix through it’s paces. I’m super excited to have Jonathan test Delphix and then learn about what he discovers. We are planning on co-presenting a webinar on the findings as well as Jonathan will be blogging is findings.
If you have questions for Jonathan about Delphix or scenarios you’d like him to test out then ask Jonathan to test them out by commenting on his blog post on Delphix:
As a performance architect at Delphix for the past 2 years, I’ve been involved in all sorts of Delphix performance related work, benchmarks and tests. I’ve spent much of that time personally pounding on Delphix and I am convinced that the technology is rock solid , fast and agile. I can’t understand why every Oracle shop doesn’t already have Delphix in place. Its incredible. With out Delphix the cloning process is like dragging huge weights around. With Delphix the cloning process is fast and light like having enormous power at ones fingertips.
Every so often a DSS query that usually takes 10 minutes ends up taking over an hour. (or one that takes an hour never seems to finish)
Why would this happen?
When investigating the DSS query, perhaps with wait event tracing, one finds that the query which is doing full table scans and should be doing large multi-block reads and waiting for “db file scattered read” is instead waiting for single block reads, ie “db file sequential read”. What the heck is going on?
Sequential reads during a full table scan scattered read query is a classic sign of reading rollback and reading rollback can make that minute(s) full table scan take hours.
What can happen especially after over night jobs, is that if an overnight job fails to finished before the DSS query is run and if that overnight job does massive updates without committing till then end, then the DSS query will have to rollback any changes made by the updates to the tables the DSS query is accessing.
How do we quickly identify if this our issue?
ASH is good at identify it. On the other hand it’s often impractical to whip up from scratch an ASH query and that’s where ashmasters on Github comes in. This ASH query and others are on Github under ashmasters.
Existing database cloning technologies allow increased development output, fewer bugs in production, and reduced DBA workload. Database virtualization, built upon these technologies, can greatly increase these gains. In this paper we’ll examine the history of using database clones to improve application development and the technical advances of thin provisioned clones and ultimately database virtualization that allow massive gains in productivity.
Oracle estimates that customers deploy, on average, 12 clones of production databases to non-production environments. These database clones are used to support the software development lifecycle – developing new functionality, testing new versions of applications by quality assurance (QA) and user acceptance testing (UAT) prior to production. The clones are also used for reporting and hoc information queries. Further, Oracle predicts this average will double by the time Oracle 12c is adopted.* Today, most cloning is accomplished by creating full physical copies of production databases. These full physical copies are time consuming to make, requiring significant DBA time, storage space, and generally lead to project delays. Development demands preclude organizations from working directly with the production database. Development of new versions of applications must be performed in a sandbox where schema changes and data additions, subtractions, and manipulations can be performed without affecting business continuity. After development, QA and UAT testing must be done on a system that matches the development specifications, along with suitable data. Finally, ad hoc and reporting queries can have unexpected resource consumption which negatively affects performance on production systems. Development and QA processes can further exacerbate the need for copies. Developers generally work on separate branches of code which can have associated requirements for database schema changes or specific datasets. If developers are sharing a database copy, the job falls to the developers to make sure they approve any changes and these changes are compatible with what everyone else is working on. This process of approving changes alone can take weeks and add much more time debugging when data or schema changes break others’ code. Ideally, developers would operate in a sandbox with a their own copy of the production test database. QA generally run multiple regression test suites, validating that the newly developed functionality works and that existing functionality hasn’t broken. When working with a single copy of a production database, this puts QA in a bind – they either have to run all tests suites simultaneously or serially. When the test suites are run simultaneously, teams run the risk of compromising the results as data are modified by multiple independent tests. Test suites can be run serially – refreshing the database copy after each test, but at a massive hit to productivity. Much like with development, the ideal scenario is a production clone for each test suite. As an example scenario, a customer with a 1 terabyte database with 100 developers and 20 suites would need close to 130 production database copies (one database copy per developer a test suite, and a few extra for branching, merging, ad hoc queries, and reporting). Understandably, very few companies have the resources (DBA time, storage) to provision these, let alone keep them refreshed for the duration of the project. Given all the high demand for clones of production databases, companies and DBAs often struggle to keep up and must make sacrifices in quality or quantity. The compromises reached are generally fewer, shared databases, partial subset databases, or a mixture of both.
Development productivity gains, reduction of production bugs, and DBA time savings have been available without extra licenses through little known functionality in Oracle since version 22.214.171.124. Even greater productivity gains are available with industry leading technologies, supporting additional versions of Oracle and other leading databases. These technologies enable productivity gains by reducing the workload and resource required to provision multiple copies of production databases. In in our previous example, creating 130 copies of a 1TB database is easily possible in the space of a single copy of the production database using thin provision cloning. Thin provision cloning gives enormous disk savings by sharing the majority of source database data blocks. A large portion of database blocks across multiple copies of a database remain the same, thus thin provision cloning allows those unchanged blocks to be shared between different clones. This technology ultimately led to database virtualization, which goes beyond thin clone provisioning to dramatically reduce the overhead of managing many cloned databases providing significant agility to development teams. Database virtualization is based on the core technology of thin provision cloning, which provides clones of production databases in less space and time than making full physical copies. Database virtualization evolves this technology to provide specific management controls, allowing virtual databases to be created, refreshed, rolled back, cloned, branched and deleted in minutes. Virtual databases can be provisioned from any time frame (down to the second) within the source database’s retention window. This functionality allows each developer and each QA test suite to have their own full copy of a production database. Further, developers and testers can have access to weeks worth of backup databases, in the space of a single backup. These backups can be brought online in minutes, data reviewed or extracted and the copy removed in minutes. Database virtualization allows DBAs to quit having to make compromises – they can provide any number of databases without worrying about the scope of the effort or the space required, and developers and testers can ensure significantly higher quality with more complete data. In recap, the three industry technologies available for making clones are:
Full physical clone
Thin provisioned clone
Next we’ll describe how each of these technologies solve the problems presented by creating copies of production databases, and the benefits that each evolutionary step provide.
Each of the technologies follows along an evolutionary path – full physical clones, thin provision clones, and database virtualization offer the ability to create multiple copies of production databases, but where they differ is in implementation feasibility and automation.
Full Physical Clone
Full physical clones are the classic way to make copies of production databases to non production environments. Full copies are just that – an entirely new instance of a database, separate from the production systems. These clones are time consuming, resource intensive, and space consuming. On average, the time to create a full physical clone is about two weeks from initial request to useable database instance. To DBAs the core issue is clear – significant work and time is invested to make exact copies, much of which is unused meaning that the majority of the data blocks are and will remain identical. Further, the work done by DBAs to create the database copies is immediately out of date and there is no easy management solution for maintaining, refreshing, or modifying these clones. Database copies can be created, however significant effort is required from the DBA, development and QA teams to work around the limitations of the system.
Thin Provisioned Cloning
Thin provisioned cloning was the first technology to address the issue of storing large numbers of identical data blocks. Thin provisioning introduces a new layer over a copy of a source databases. Each clone has a separate thin layer where the clone maintains its changes to the central copy, which remains unchanged. As each clone has a separate thin layer that only it can see, each has the appearance of being a full physical copy of the source database. Thin provisioning can eliminate much of the space demand of database copies, reducing the associate storage cost of non-production database copies. There are three categories of thin provisioning technology:
Single point in time
Multiple but limited points in time
Multiple but limited points in time in a rolling window
Single Point in Time
Single point in time thin provision cloning is the simplest thin provisioning technology, but the least flexible. Single point in time thin provisioning takes a full database backup at a point in time and allows multiple clones to open this backup. The technical innovation is allowing each clone to write any changes to a private area, thus each clone shares the majority of data blocks with the other clones but the private change area makes it appear to each clone as if they have a full size read/write copy of the database. The downside to this technology is that it does not account for database refreshes – any time a clone requires a newer version of the source database, then an entire new copy of the source database has to be made. Further, it is only appropriate for situations in which high performance is not a key requirements as it is notably slower than its physical counterparts. Finally, there is significant scripting required and limited documentation available, meaning that the onus is on the DBA to manage and own the environment. Oracle first offered this technology in an obscure feature called DBclone in Oracle 126.96.36.199#, however it has performance and management overhead even in limited use and not appropriate for enterprise level development.
Multiple limited clone versions
To address the issue of database refreshes, EMC and Fujitsu offer thin provisioned cloning technology which allows sharing data blocks across multiple versions of the source databases. This technology is based on file systems that can take point-in-time snapshots. The point-in-time snapshot can be cloned to provide a private read/write version of that file system. As changes come into the file system from the source database, new file system snapshots and clones can be created allowing multiple point in time database views. Unfortunately, after a limited number of snapshots (generally around ten), the system has to be rebuilt requiring a complete new copy of the original database. In addition to periodic rebuilds, these systems also incur major performance hits. The performance hits can be so serious on VMware’s Data Directory linked clone technology that VMware recommends against using it for Oracle databases.
Continuous data versions
NetApp offers the ability to not only snapshot and then create clones from the snapshots but also drop any blocks from the original snapshot that are no longer needed, allowing a continuous rolling window of snapshots from the source database. Custom retention windows can be set up – new data blocks are constantly added and old data blocks dropped. As an example, if a two week retention window was desired, the system could snapshot the source database once a day and clones could share snapshots anywhere in that two week window. Blocks particular to snapshots falling outside of the two week time window could be dropped, thus allowing the system to run continuously without requiring rebuilds. While this offers quite a bit of functionality not possible with other thin provisioned clones, there are a number of serious downsides that prevent most enterprises from deploying it.
Hardware Lock-in: To provision this functionality NetApp requires buying specialized hardware which requires unique administration. Administrators using this functionality with the NetApp hardware are required to write custom scripts to set up the system.
LUN-Level Snapshots: NetApp works on LUNs, taking snapshots and making clones of the full LUN as opposed to the datafiles. As it works at the LUN level, it can not detect any corruption in the datafiles that would otherwise be found using RMAN APIs to get the database backups.
Custom Scripting: Custom scripting is required to make the original database backup and keep the backup updated with changes from the source database.
Clone Creation: NetApp doesn’t supply any functionality to actually provision the clone databases, and clones can only be made from snapshots.
Clone Flexibility: As clone can only be made from snapshots, a number of key use cases cannot be accomplished – clones can’t be created from any timestamp, can’t be rolled back, and can’t be branched.
Oracle’s ZFS storage appliance has a similar capability as Netapp but requires even more scripting and manual administration than Netapp thus has seen little to no uptake.
Thin provisioned clones have been around for almost two decades, yet it has seen very limited uptake due to the need for specialized hardware, expert knowledge, and scripting. These barriers to entry and the limited set of use cases have ensured that thin provisioned cloning remains an underutilized technology. Database virtualization was invented to take the benefits of thin provisioned clones, couple it with simple management, and provide significant more data agility through on-demand database access. Database virtualization takes the core technology of thin provisioned cloning and extends it providing the ability to:
Automate initial source database backup, snapshots, and redo log collection.
Automate data retention, clearing out data older than designated time window
Automate provisioning a clone from any SCN or second
Provision clones from multiple sources to the same point in time
Enable cloning of clones, branching clones, and rolling back clones
Efficiently store all the changes from source database
Run continually and automatically
End user virtual database provisioning
Easy enough to be run by non-DBA, non-sysadmin
Database virtualization technology allows virtual database to be made in minutes, taking up almost no space since the virtual database only creates new control files, redo log files and a new temporary table space. All the rest of the data is initially shared. This allows the following advantages:
Databases on demand
Higher quality testing
Databases on Demand
Virtual databases can be self provisioned in a matter of minutes, eliminating significant bureaucracy. Provisioning full physical copies can takes weeks, virtual databases take minutes now by eliminating both the data copying time of the production database as well as all the time for requesting, discussing, processing and allocating resources. When a developer needs a clone they typically have to ask their manager, DBA, storage admin, etc. The managerial decision making process, administrative tasks and coordination meetings often take weeks. With database virtualization all of the overhead can be eliminated. The developer can provision their own virtual database in minutes, with no storage overhead.
As the resource and operational cost of providing database copies are eliminated with database virtualization, teams of developers can go from sharing one full physical production copy to each having their own private copy. With a private copy of the database, a developer can change schema and metadata as fast as they want instead of waiting days or weeks of review time to check in changes to a shared development database.
Higher quality testing
With as many virtual databases as needed, QA teams no longer need to rely on one full copy of the source database on which to run tests. With a single database, QA teams often have to stop and refresh and ensure they’re not overlapping tests. With database virtualization, QA can run many tests concurrently and the virtual databases can be refreshed back to the original state in minutes allow immediate repeated replay of test suites, captured workloads and patch applications.
Database virtualization can dramatically reduce the amount of storage required for database copies. As the majority of the data blocks are similar, database virtualization requires storing the changed blocks, and even those can be compressed. Database virtualization not only saves disk space but can also save RAM. RAM on the virtual database hosts can be minimized because virtual databases share the same data files and can share the same blocks in the file system cache. No longer does each copy require private memory to cache the data.
Database Virtualization Examples
The Delphix Server is a software stack that implements database virtualization using the Delphix file system (DxFS). The Delphix Server automates the process of database virtualization and management, and doesn’t require any specialized hardware. It only requires an x86 box to run the software and access to LUNs with about the same amount of the disk space of the database to be virtualized. The source database is backed up onto the Delphix virtual appliance via automated RMAN APIs, the data is compressed, Delphix automates syncing of the local copy with changes in production, freeing of data blocks outside the time retention window and Delphix handles the provisioning of virtual databases. A virtual database can be provisioned from any SCN or second in time during the retention window (typically two weeks).
Oracle is enabling database virtualization in Oracle 12c with Snapshot Manager Utility (SMU) a pay for licensed software utility . The utility runs on the Oracle ZFS storage appliance, where the the source database data files are stored.
Thin provision cloning has been around nearly two decades but has not been widely adopted due to the high barriers to entry. These barriers, including specialized hardware, consistent system rebuilds, specialized storage administrators, and custom scripting have led to the de facto solution being physical clones. Short of a more attractive option, companies have opted to create full or partial physical clones and deal with the ramifications of incomplete datasets, refresh difficulty, and concurrent use. With database virtualization, the hardware and management barriers have finally been eliminated allowing enterprises to offer significant database agility.
Here are a list of the technologies that can be used to create thin provision clones
EMC – system rebuild issues after a few snapshots, hardware lock-in, requires advanced scripting, performance issues
Data Director (VMware) - system rebuild issues, performance issues, x86 databases only, officially not supported for thin provisioning cloning of Oracle databases
Oracle 12c Snapshot Manager Utility (SMU) – hardware lock-in, requires source database have it’s datafiles located on Oracle ZFS Appliance
Delphix – automated solution for both administrator and end user. Delphix works for Oracle 9,10,11 on RAC, Standard Edition and Enterprise Edition. Fully automated with time retention windows and end user self service provisioning. Also support SQL Server databases. With Delphix there are no size restrictions and unlimited clones and snapshots. Snapshots can even be taken of snapshots creating branched versions of source databases.
One thing that I have found sorely missing in the performance pages of Enterprise Manager is latency values for various types of I/O. The performance page or top activity may show high I/O waits but it won’t indicated if the latency of I/O is unusually high or not. Thus I put together a shell script that shows latency for the main I/O waits
db file sequential read
db file scattered read
log file parallel write
direct path reads
direct path reads temp
Of course it would be nice to add a few others like direct path writes, direct path writes temp and log file sync but there is only so much room in the screen width.
The script is called oramon.sh and is available on github at
Usage: oramon.sh [username] [password] [host] [sid] <port=1521> <runtime=3600>
$ ./oramon.sh system sys 172.16.100.81 vsol
Connected, starting collect at Wed Apr 18 18:41:13 UTC 2012
starting stats collecting
single block logfile write multi block direct read direct read temp
ms IOP/s ms IOP/s ms IOP/s ms IOP/s ms IOP/s
20.76 27.55 32.55 .71 3.50 .00 .00 .01 .00 .00
.00 .20 .00 .00 .00 .00
34.93 369.64 116.79 3.55 .00 .00 .00
31.43 640.33 92.40 8.33 .00 .00 .00
39.39 692.33 111.69 8.00 .00 .00 .00
The first line of output is the average since the database started up.
The subsequent lines are the averages since the last line which is 5 seconds by default.
One should be able to see immediately how much activity there is on the database and the latency for the basic types of database I/O.
Reads Single block reads are the typical I/O from a database which would happen for example when reading a row in a table with indexes in place. Multi block reads are common as well is which would happen when for example summing the values over all rows in a table. Direct reads are less common but quite normal and happen almost exclusively for parallel query though may be used for other activities especially in newer version of Oracle such as 11.2. Direct reads are multiblock reads that by pass the Oracle buffer cache. The size varies from a datablock, such as 8k to 1MB. Direct read temp happens when a sort has overflowed memory limits and been written to disk. Direct reads temp are multiblock reads that by pass the Oracle buffer cache. The size varies from a datablock, such as 8k to 1MB.
Writes Logfile writes are the only writes that database users wait for in general. Actually users only wait when the commit, which then is a wait for a signal from the log writer that their particular redo data is on disk which could have already happened. Typically the user wait time is a bit slower than the logwrite time but in general it’s close, ie within a few milliseconds. The farther apart the user wait time is from the log write time the more likely there is a CPU, paging or other concurrency problem on the VDB host slowing down the users signalling and wake up time.
oramon.sql : Oracle Latency Query
If for some reason the shell script isn’t able to connect to the database, then the same data can be collected manually by running the SQL query in SQL*Plus by hand.
The following two SQL queries, oramon_setup.sql and oramon.sql are available on github at
If you want to see the latencies over periods shorter than 60s, then you have to collect the values of the cumulative counters at time A, then again at time B and take the difference. The following two queries, oramon.sql and oramon_setup.sql, are available on ftp site
Run oramon_setup.sql *once*
column seq_ms for 9999.99
column seq_ct for 9999.99
column lfpw_ms for 9999.99
column lfpw_ct for 9999.99
column seq_ms for 9999.99
column scat_ct for 9999.99
column dpr_ms for 9999.99
column dpr_ct for 9999.99
column dprt_ms for 9999.99
column dprt_ct for 9999.99
column prevdprt_ct new_value prevdprt_ct_var
column prevdprt_tm new_value prevdprt_tm_var
column prevdpwt_ct new_value prevdpwt_ct_var
column prevdpwt_tm new_value prevdpwt_tm_var
column prevdpr_ct new_value prevdpr_ct_var
column prevdpr_tm new_value prevdpr_tm_var
column prevdpw_ct new_value prevdpw_ct_var
column prevdpw_tm new_value prevdpw_tm_var
column prevseq_ct new_value prevseq_ct_var
column prevseq_tm new_value prevseq_tm_var
column prevscat_ct new_value prevscat_ct_var
column prevscat_tm new_value prevscat_tm_var
column prevlfpw_ct new_value prevlfpw_ct_var
column prevlfpw_tm new_value prevlfpw_tm_var
column prevsec new_value prevsec_var
select 0 prevsec from dual;
select 0 prevseq_tm from dual;
select 0 prevseq_ct from dual;
select 0 prevscat_ct from dual;
select 0 prevscat_tm from dual;
select 0 prevlfpw_ct from dual;
select 0 prevlfpw_tm from dual;
select 0 prevdprt_ct from dual;
select 0 prevdprt_tm from dual;
select 0 prevdpwt_ct from dual;
select 0 prevdpwt_tm from dual;
select 0 prevdpr_ct from dual;
select 0 prevdpr_tm from dual;
select 0 prevdpw_ct from dual;
select 0 prevdpw_tm from dual;
column prevdprt_ct noprint
column prevdprt_tm noprint
column prevdpwt_ct noprint
column prevdpwt_tm noprint
column prevdpr_ct noprint
column prevdpr_tm noprint
column prevdpw_ct noprint
column prevdpw_tm noprint
column prevseq_ct noprint
column prevseq_tm noprint
column prevscat_ct noprint
column prevscat_tm noprint
column prevlfpw_ct noprint
column prevlfpw_tm noprint
column prevsec noprint
Run following query to see the current latency for
NOTE: the following is a simpler query but the data only updates once a minute
nvl(round(10*m.time_waited/nullif(m.wait_count,0),3) ,0) avg_ms
from v$eventmetric m,
wait_class_id= 1740759767 -- User I/O
wait_class_id= 4108307767 -- System I/O
and m.wait_count > 0 ;