Getting Started with Dtrace
Structure of a dtrace script
#!/usr/sbin/dtrace -s something_to_trace / filters / { actions } Something_else_to_trace / filters_optional / { take some actions }
The script has sections that fire if the specified probe fires in the OS. For example, if do a send over TCP then my “something_to_trace” could be a probe (an event) called “tcp:::send” . I could further filter by receiving machine’s IP address. Then when a packet is sent over TCP and the receiver is the IP in the filter I can take some actions like state the size of the packet.
What can I trace? What are the possible “something_to_trace”, ie the probes?
To get a list run of OS probes, run
dtrace -l
id provider module function name
sudo dtrace -ln tcp::: ID PROVIDER MODULE FUNCTION NAME 7301 tcp ip tcp_input_data receive 7302 tcp ip tcp_input_listener receive 7303 tcp ip tcp_xmit_listeners_reset receive 7304 tcp ip tcp_fuse_output receive
dtrace -lvn tcp:ip:tcp_input_data:receive ID PROVIDER MODULE FUNCTION NAME 7301 tcp ip tcp_input_data receive Argument Types args[0]: pktinfo_t * args[1]: csinfo_t * args[2]: ipinfo_t * args[3]: tcpsinfo_t * args[4]: tcpinfo_t *
(also, can look in /usr/lib/dtrace or previously http://cvs.opensolaris.org/source/)
For example, I see that args[3] is “tcpsinfo_t. ” What is “tcpsinfo_t?” I can type “tcpsinfo_t” into the Symbol field as in
http://src.illumos.org
I get a list of appearances of this structure. I click on the first one as it looks like it’s probably the structure definition and get:
So now I can see the contents of the structure and variable types of each field in the structure. From the above I see the receivers address is “tcps_raddr” in arg[3] so I can now filter for the receivers address.
A good example of trying to understand dtrace probes and arguments is found here: http://blogs.oracle.com/observatory/entry/d_script_archeology
And now I’m ready to put together a script accessing the contents of these structures
#!/usr/sbin/dtrace -s #pragma D option quiet tcp:::send / args[3]->tcps_raddr == "192.168.1.1" / { printf("send %6d \n",args[2]->ip_plength); } tcp:::receive / args[3]->tcps_raddr == "192.168.1.1" / { printf("receive %6d \n", args[2]->ip_plength ); }
(added the line “#pragma D option quiet” to suppress the default output of the names of probes that fired and limit the output to only the print statements)
My main point of this blog was to show how to get a list of OS probes to trace and for those probes how to find out the contents of the data structures available for that probe. The access to these structures depends on “translators.” See
http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Features
The translators are defined in /usr/lib/dtrace. So instead of looking up structures in the Illumos source one can look for them in the translator files in /usr/lib/dtrace. The translators provide a stable way to access the OS structures. The OS structures can be accessed directly without the translators but it’s more complicated and not stable across versions of the OS not to mention types of OS.
For Oracle, there are no translators though you could write them.
For Oracle, to get an idea what you can trace, get the PID of a running Oracle process, say in my case 11602
dtrace -ln pid11602:oracle:kcbr*:
The list will be quite large. It more or less requires knowing what you are looking for. Above I give “kcbr*” for Kernel Cache Buffer Recent. Checkout http://www.orafaq.com/wiki/X$_Table for an idea of the layers that make up the start of some of the function names. ( it would be a lot more fun trace Oracle with DTrace if one had access to the source!)
The best references for dtrace are first the dtrace book and dtrace community then Brendan Gregg’s blog at:
- http://dtrace.org/blogs/brendan/
If you are specifically interested in TCP then Alan Maguire’s blog is good
- http://blogs.oracle.com/amaguire/
And for some specific examples with Oracle check out:
- http://alexanderanokhin.wordpress.com/2011/11/13/dynamic-tracing-of-oracle-logical-io/
- http://andreynikolaev.wordpress.com/2010/10/28/appetizer-for-dtrace/
- http://blog.tanelpoder.com/2009/04/24/tracing-oracle-sql-plan-execution-with-dtrace/
- http://www.slideshare.net/dougburns/dtrace-miracle-scotland-database-forum
A few other notes:
Useful variables that can be filtered on or printed out
- pid – process id
- execname – executable name
- timestamp – timestamp in nano-seconds
- tid – thread id
- cwd – current working directory
- probeprov, probemod, probefunc, probename – probe’s provider, module, func and name
Dtrace variables (from Brendan’s blog see: http://dtrace.org/blogs/brendan/2011/11/25/dtrace-variable-types/ )
DTrace variable types, in order of preference:
type | prefix | scope | overhead | multi-CPU safe | example assignment |
---|---|---|---|---|---|
aggregation | @ | global | low | yes | @x = count(); |
clause local | this-> | clause instance[1] | very low | yes | this->x = 1; |
thread local | self-> | thread | medium | yes | self->x = 1; |
scalar | none | global | low-medium | no[2] | x = 1; |
associative array | none | global | medium-high | no[2] | x[y] = 1; |
The above list is a nice cheat sheet.
I started off using scalar global variables but the right way is to use aggregations, clause local and thread local variables
Definitely read about aggregates when getting started with Dtrace as aggregates are key to getting useful data out of dtrace
http://wikis.sun.com/display/DTrace/Aggregations
For formatting output data, wrap scripts with something like Perl and post process the data. Here is an example of wrapping (piping) dtrace data to perl and having perl format the output (Perl code is set up to aggregate similar types of data coming from I/O, ZFS an NFS, but below is just the code for monitoring NFS, )
/usr/sbin/dtrace -n ' #pragma D option quiet nfsv3:::op-read-start, nfsv3:::op-write-start ,nfsv4:::op-read-start { tm[args[1]->noi_xid] = timestamp; sz[args[1]->noi_xid] = args[2]->count ; } nfsv4:::op-write-start { tm[args[1]->noi_xid] = timestamp; sz[args[1]->noi_xid] = args[2]->data_len ; } nfsv3:::op-read-done, nfsv3:::op-write-done, nfsv4:::op-read-done, nfsv4:::op-write-done /tm[args[1]->noi_xid]/ { this->delta= (timestamp - tm[args[1]->noi_xid]); this->type = probename == "op-write-done" ? "W" : "R"; @nfs_tm[this->type]=sum(this->delta); @nfs_mx[this->type]=max( (this->type == "R" ? this->delta : 0)); @nfs_mx[this->type]=max( (this->type == "W" ? this->delta : 0)); @nfs_ct[this->type]=count(); @nfs_sz[this->type]=sum(sz[args[1]->noi_xid]); tm[args[1]->noi_xid] = 0; sz[args[1]->noi_xid] = 0; } profile:::tick-1sec { printa("nfs_tm ,%s,%@d\n",@nfs_tm); printa("nfs_mx ,%s,%@d\n",@nfs_mx); printa("nfs_ct ,%s,%@d\n",@nfs_ct); printa("nfs_sz ,%s,%@d\n",@nfs_sz); clear(@nfs_mx); clear(@nfs_tm); clear(@nfs_ct); clear(@nfs_sz); printf("!\n"); } ' | perl -e ' while (my $line = ) { $line=~ s/\s+//g; if ( $line eq "!" ) { printf("--|------| %10s %10s %10s %10s\n", "avg_ms","MB/s","mx_ms","count"); foreach $type ('R','W') { foreach $class ('nfs') { $ct=${$class . "_ct"}{$type}||0; $sz=${$class . "_sz"}{$type}||0; $mx=${$class . "_mx"}{$type}||0; $tm=${$class . "_tm"}{$type}||0; if ( $ct > 0 ) { $ms=(($tm/1000000)/$ct); printf("$type | $class : %10.2f",$ms); } else { printf("$type | $class : %10.2f",0) ; } printf(" %10.2f",$sz/(1024*1024)); $mx_ms=$mx/1000000; printf(" %10.2f",$mx_ms); printf(" %10d",$ct); print "\n"; } } } else { ($area, $r_w, $value)=split(",",$line); ${$area}{$r_w}=$value; } }'
PS I tried using AWK for formating but the data seems to get buffered so there isn’t immediate output. I tried using “fflush” and print /dev/stdout but couldn’t get it to consistently work
dtrace [code] | awk ' { # hack to get AWK to flush out immediately printf("%s\n",$0) > "/dev/stdout" close("/dev/stdout") #fflush(stdout) #system("") }'
PS one thing to keep in mind is overhead. The overhead of Dtrace is small but it’s not zero. See
http://dtrace.org/blogs/brendan/2011/02/18/dtrace-pid-provider-overhead/
For more info. From the above link from Brendan Gregg, here is the key part:
I’d suggest considering pid provider overhead as follows:
Don’t worry too much about pid provider probe cost at < 1000 events/sec.
At > 10,000 events/sec, pid provider probe cost will be noticeable.
At > 100,000 events/sec, pid provider probe cost may be painful.
Let’s discuss these using the 6 us per probe result from the previous tests. This could be faster (~2 us) or slower (> 15 us) depending on architecture and the probe action.
- at 1000 events/sec, that could cost 6 milliseconds of CPU time. If the application was 100% on-CPU, it may slow by 0.6% – ie, don’t worry.
- At 10,000 events/sec, that could cost 60 milliseconds of CPU time. For a 100% on-CPU application, then the slow down may be 6%. If it was idle some of the time (systems often are), then the overhead may be less, as there are spare cycles to run DTrace (really depends on the location of the probe in the code-path, though).
- At 100,000 events/sec, the CPU cost could be 600 milliseconds. This can significantly slow down an application. Whether that’s a problem depends on the app and its function – how sensitive it is to latency.
DTrace is quite terse and limited in actions and functions. As I said above, for formatting data, wrap DTrace in perl and do calculations and formating in perl. Why? A few examples
- DTrace doesn’t have floating point operations
- DTrace aggregate arrays don’t have a method allowing you to query the keys and step through them, thus there is no way to do calculations on values in one aggregate array with values in another.
Just save yourself time and head ache and do calculations and formatting in perl.
DTrace is limited in functions. Many of the functions are hard to track down and/or have little documentation. Here is a list of functions:
http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Internals
http://developers.sun.com/solaris/articles/dtrace_quickref/dtrace_functions.html
Trackbacks
Comments