DB Optimizer » Getting Started with Dtrace

Getting Started with Dtrace

December 13th, 2011

Structure of a dtrace script

#!/usr/sbin/dtrace -s

something_to_trace
/ filters /
{ actions }

Something_else_to_trace
/ filters_optional /
{ take some actions }

The script has sections that fire if the specified probe fires in the OS. For example, if do a send over TCP then my “something_to_trace” could be a probe (an event) called “tcp:::send” . I could further filter by receiving machine’s IP address. Then when a packet is sent over TCP and the receiver is the IP in the filter I can take some actions like state the size of the packet.

What can I trace? What are the possible “something_to_trace”, ie the probes?
To get a list run of OS probes, run

  dtrace -l

-l = list instead of enable probes

The list will be long. It’s a good to have some idea what one is searching for like “io”, “tcp”, “zfs” etc and grep for areas of interest.

The output format is 5 columns , first the probe id and followed by the name in 4 parts

 id  provider  module  function  name

One can get the list of probes just for a subset using “-ln” flag and include part of the probe either provider, module, function or name. The provider, module,function and name are concatenated together with colons (:). I commonly only use provider and name, for example

  sudo dtrace -ln tcp:::
   ID   PROVIDER            MODULE                          FUNCTION NAME
 7301        tcp                ip                    tcp_input_data receive
 7302        tcp                ip                tcp_input_listener receive
 7303        tcp                ip          tcp_xmit_listeners_reset receive
 7304        tcp                ip                   tcp_fuse_output receive

-n = Specify probe name to trace or list

Now if I have a specific probe I want to trace, I can get more information about the data available for that probe using the verbose flag as in:

    dtrace -lvn tcp:ip:tcp_input_data:receive
       ID   PROVIDER            MODULE                          FUNCTION NAME
     7301        tcp                ip                    tcp_input_data receive

        Argument Types
                args[0]: pktinfo_t *
                args[1]: csinfo_t *
                args[2]: ipinfo_t *
                args[3]: tcpsinfo_t *
                args[4]: tcpinfo_t *

Now I see the arguments for this probe and I have access to these arguments when this probe fires, but what do these arguments contain? Here a crucial trick. You can look up these arguments on:

http://src.illumos.org

(also, can look in /usr/lib/dtrace or previously http://cvs.opensolaris.org/source/)

For example, I see that args[3] is “tcpsinfo_t. ” What is “tcpsinfo_t?” I can type “tcpsinfo_t” into the Symbol field as in

http://src.illumos.org

I get a list of appearances of this structure. I click on the first one as it looks like it’s probably the structure definition and get:

So now I can see the contents of the structure and variable types of each field in the structure. From the above I see the receivers address is “tcps_raddr” in arg[3] so I can now filter for the receivers address.

A good example of trying to understand dtrace probes and arguments is found here: http://blogs.oracle.com/observatory/entry/d_script_archeology

And now I’m ready to put together a script accessing the contents of these structures

#!/usr/sbin/dtrace -s
#pragma D option quiet
tcp:::send
/ args[3]->tcps_raddr == "192.168.1.1" /
{ printf("send     %6d \n",args[2]->ip_plength);
}
tcp:::receive
/ args[3]->tcps_raddr == "192.168.1.1" /
{ printf("receive %6d  \n", args[2]->ip_plength );
}

(added the line “#pragma D option quiet” to suppress the default output of the names of probes that fired and limit the output to only the print statements)

My main point of this blog was to show how to get a list of OS probes to trace and for those probes how to find out the contents of the data structures available for that probe. The access to these structures depends on “translators.” See

http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Features

The translators are defined in /usr/lib/dtrace. So instead of looking up structures in the Illumos source one can look for them in the translator files in /usr/lib/dtrace. The translators provide a stable way to access the OS structures. The OS structures can be accessed directly without the translators but it’s more complicated and not stable across versions of the OS not to mention types of OS.

For Oracle, there are no translators though you could write them.

For Oracle, to get an idea what you can trace, get the PID of a running Oracle process, say in my case 11602

    dtrace -ln pid11602:oracle:kcbr*:

The list will be quite large. It more or less requires knowing what you are looking for. Above I give “kcbr*” for Kernel Cache Buffer Recent. Checkout http://www.orafaq.com/wiki/X$_Table for an idea of the layers that make up the start of some of the function names. ( it would be a lot more fun trace Oracle with DTrace if one had access to the source!)

The best references for dtrace are first the dtrace book and dtrace community then Brendan Gregg’s blog at:

http://dtrace.org/blogs/brendan/

If you are specifically interested in TCP then Alan Maguire’s blog is good

http://blogs.oracle.com/amaguire/

And for some specific examples with Oracle check out:

http://alexanderanokhin.wordpress.com/2011/11/13/dynamic-tracing-of-oracle-logical-io/
http://andreynikolaev.wordpress.com/2010/10/28/appetizer-for-dtrace/
http://blog.tanelpoder.com/2009/04/24/tracing-oracle-sql-plan-execution-with-dtrace/
http://www.slideshare.net/dougburns/dtrace-miracle-scotland-database-forum

A few other notes:

Useful variables that can be filtered on or printed out

pid – process id
execname – executable name
timestamp – timestamp in nano-seconds
tid – thread id
cwd – current working directory
probeprov, probemod, probefunc, probename – probe’s provider, module, func and name

Dtrace variables (from Brendan’s blog see: http://dtrace.org/blogs/brendan/2011/11/25/dtrace-variable-types/ )

DTrace variable types, in order of preference:

type	prefix	scope	overhead	multi-CPU safe	example assignment
aggregation	@	global	low	yes	@x = count();
clause local	this->	clause instance[1]	very low	yes	this->x = 1;
thread local	self->	thread	medium	yes	self->x = 1;
scalar	none	global	low-medium	no[2]	x = 1;
associative array	none	global	medium-high	no[2]	x[y] = 1;

The above list is a nice cheat sheet.
I started off using scalar global variables but the right way is to use aggregations, clause local and thread local variables
Definitely read about aggregates when getting started with Dtrace as aggregates are key to getting useful data out of dtrace

http://wikis.sun.com/display/DTrace/Aggregations

For formatting output data, wrap scripts with something like Perl and post process the data. Here is an example of wrapping (piping) dtrace data to perl and having perl format the output (Perl code is set up to aggregate similar types of data coming from I/O, ZFS an NFS, but below is just the code for monitoring NFS, )

/usr/sbin/dtrace -n '
#pragma D option quiet
nfsv3:::op-read-start, nfsv3:::op-write-start ,nfsv4:::op-read-start {
        tm[args[1]->noi_xid] = timestamp;
        sz[args[1]->noi_xid] = args[2]->count    ;
}
nfsv4:::op-write-start {
        tm[args[1]->noi_xid] = timestamp;
        sz[args[1]->noi_xid] = args[2]->data_len ;
}
nfsv3:::op-read-done, nfsv3:::op-write-done, nfsv4:::op-read-done, nfsv4:::op-write-done
/tm[args[1]->noi_xid]/
{       this->delta= (timestamp - tm[args[1]->noi_xid]);
        this->type =  probename == "op-write-done" ? "W" : "R";
        @nfs_tm[this->type]=sum(this->delta);
        @nfs_mx[this->type]=max( (this->type == "R" ? this->delta : 0));
        @nfs_mx[this->type]=max( (this->type == "W" ? this->delta : 0));
        @nfs_ct[this->type]=count();
        @nfs_sz[this->type]=sum(sz[args[1]->noi_xid]);
        tm[args[1]->noi_xid] = 0;
        sz[args[1]->noi_xid] = 0;
}
profile:::tick-1sec
{      printa("nfs_tm ,%s,%@d\n",@nfs_tm);
       printa("nfs_mx ,%s,%@d\n",@nfs_mx);
       printa("nfs_ct ,%s,%@d\n",@nfs_ct);
       printa("nfs_sz ,%s,%@d\n",@nfs_sz);
       clear(@nfs_mx);
       clear(@nfs_tm);
       clear(@nfs_ct);
       clear(@nfs_sz);
       printf("!\n");
}
' | perl -e '
while (my $line = ) {
       $line=~ s/\s+//g;
       if ( $line eq "!"  ) {
          printf("--|------| %10s %10s %10s %10s\n", "avg_ms","MB/s","mx_ms","count");
          foreach $type ('R','W') {
            foreach $class ('nfs') {
             $ct=${$class . "_ct"}{$type}||0;
             $sz=${$class . "_sz"}{$type}||0;
             $mx=${$class . "_mx"}{$type}||0;
             $tm=${$class . "_tm"}{$type}||0;
               if ( $ct > 0 ) {
                   $ms=(($tm/1000000)/$ct);
                   printf("$type | $class  : %10.2f",$ms);
               } else {
                   printf("$type | $class  : %10.2f",0) ;
               }
               printf(" %10.2f",$sz/(1024*1024));
               $mx_ms=$mx/1000000;
               printf(" %10.2f",$mx_ms);
               printf(" %10d",$ct);
               print "\n";
            }
          }
       } else {
          ($area, $r_w, $value)=split(",",$line);
        ${$area}{$r_w}=$value;
       }
}'

PS I tried using AWK for formating but the data seems to get buffered so there isn’t immediate output. I tried using “fflush” and print /dev/stdout but couldn’t get it to consistently work

dtrace [code] | awk  '
{
       # hack to get AWK to flush out immediately
       printf("%s\n",$0) > "/dev/stdout"
       close("/dev/stdout")
       #fflush(stdout)
       #system("")
}'

PS one thing to keep in mind is overhead. The overhead of Dtrace is small but it’s not zero. See

http://dtrace.org/blogs/brendan/2011/02/18/dtrace-pid-provider-overhead/

For more info. From the above link from Brendan Gregg, here is the key part:

I’d suggest considering pid provider overhead as follows:

Don’t worry too much about pid provider probe cost at < 1000 events/sec.
At > 10,000 events/sec, pid provider probe cost will be noticeable.
At > 100,000 events/sec, pid provider probe cost may be painful.

Let’s discuss these using the 6 us per probe result from the previous tests. This could be faster (~2 us) or slower (> 15 us) depending on architecture and the probe action.

at 1000 events/sec, that could cost 6 milliseconds of CPU time. If the application was 100% on-CPU, it may slow by 0.6% – ie, don’t worry.
At 10,000 events/sec, that could cost 60 milliseconds of CPU time. For a 100% on-CPU application, then the slow down may be 6%. If it was idle some of the time (systems often are), then the overhead may be less, as there are spare cycles to run DTrace (really depends on the location of the probe in the code-path, though).
At 100,000 events/sec, the CPU cost could be 600 milliseconds. This can significantly slow down an application. Whether that’s a problem depends on the app and its function – how sensitive it is to latency.

DTrace is quite terse and limited in actions and functions. As I said above, for formatting data, wrap DTrace in perl and do calculations and formating in perl. Why? A few examples

DTrace doesn’t have floating point operations
DTrace aggregate arrays don’t have a method allowing you to query the keys and step through them, thus there is no way to do calculations on values in one aggregate array with values in another.

Just save yourself time and head ache and do calculations and formatting in perl.

DTrace is limited in functions. Many of the functions are hard to track down and/or have little documentation. Here is a list of functions:

http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Internals

http://developers.sun.com/solaris/articles/dtrace_quickref/dtrace_functions.html

Kyle Hailey
Uncategorized