DB Optimizer » TCP monitoring strangeness continued

TCP monitoring strangeness continued

May 25th, 2011

On
http://dboptimizer.com/2011/04/27/envisioning-nfs-performance/
I showed some code that displayed the send and receive sizes and times over TCP on Solaris 10 with the dtrace command (see tcp.d for the code). I took this code on another machine and got errors like

“dtrace: error on enabled probe ID 29 (ID 5461: tcp:ip:tcp_input_data:receive): invalid alignment (0xffffff516eb5e83a) in predicate at DIF offset 136”

Not quite sure why this was happening but by a process of elimination I found that accessing args[4] in tcp:::receive caused these errors.

Fortunately many of the values in args[4] are found in args[3] as well.

To find arguments to tcp:::receive , first run the following command (or alternatively look the TCP probes up on the wiki at http://wikis.sun.com/display/DTrace/tcp+Provider)

( for command line flags check out http://compute.cnr.berkeley.edu/cgi-bin/man-cgi?dtrace+7

-l = list instead of enable probes
-n = Specify probe name to trace or list
-v = Set verbose mode

i.e. list the verbose information about the probes named and don’t enable these probes, just list them)

$ dtrace -lvn tcp:::receive
 5473        tcp                ip                        tcp_output send

        Argument Types
                args[0]: pktinfo_t *
                args[1]: csinfo_t *
                args[2]: ipinfo_t *
                args[3]: tcpsinfo_t *
                args[4]: tcpinfo_t *

(by the way, there are a number of probes that match tcp:::receive, but they all have the same arguments, I didn’t notice all these different tcp:::receive until I actually ran the above command. Before running the command I’d depended on the wiki. Now I’m wondering what the difference is between some of these tcp:::receive and tcp:::send probes )

After finding the args for a probe, you can look up the definition of the structs at http://cvs.opensolaris.org/source/

tcp:::send and tcp:::receive both have the same arguments

args[3] is tcpsinfo_t
args[4] is tcpinfo_t

Looking up the structs at http://cvs.opensolaris.org/source/, I find the contents of the structs as follows:

tcpsinfo_t ( args[3] )

     111 typedef struct tcpsinfo {
    112         uintptr_t tcps_addr;
    113         int tcps_local;                 /* is delivered locally, boolean */
    114         int tcps_active;                /* active open (from here), boolean */
    115         uint16_t tcps_lport;            /* local port */
    116         uint16_t tcps_rport;            /* remote port */
    117         string tcps_laddr;              /* local address, as a string */
    118         string tcps_raddr;              /* remote address, as a string */
    119         int32_t tcps_state;             /* TCP state */
    120         uint32_t tcps_iss;              /* Initial sequence # sent */
    121         uint32_t tcps_suna;             /* sequence # sent but unacked */
    122         uint32_t tcps_snxt;             /* next sequence # to send */
    123         uint32_t tcps_rack;             /* sequence # we have acked */
    124         uint32_t tcps_rnxt;             /* next sequence # expected */
    125         uint32_t tcps_swnd;             /* send window size */
    126         int32_t tcps_snd_ws;            /* send window scaling */
    127         uint32_t tcps_rwnd;             /* receive window size */
    128         int32_t tcps_rcv_ws;            /* receive window scaling */
    129         uint32_t tcps_cwnd;             /* congestion window */
    130         uint32_t tcps_cwnd_ssthresh;    /* threshold for congestion avoidance */
    131         uint32_t tcps_sack_fack;        /* SACK sequence # we have acked */
    132         uint32_t tcps_sack_snxt;        /* next SACK seq # for retransmission */
    133         uint32_t tcps_rto;              /* round-trip timeout, msec */
    134         uint32_t tcps_mss;              /* max segment size */
    135         int tcps_retransmit;            /* retransmit send event, boolean */

tcpinfo_t (args[4])

     95 typedef struct tcpinfo {
     96         uint16_t tcp_sport;             /* source port */
     97         uint16_t tcp_dport;             /* destination port */
     98         uint32_t tcp_seq;               /* sequence number */
     99         uint32_t tcp_ack;               /* acknowledgment number */
    100         uint8_t tcp_offset;             /* data offset, in bytes */
    101         uint8_t tcp_flags;              /* flags */
    102         uint16_t tcp_window;            /* window size */
    103         uint16_t tcp_checksum;          /* checksum */
    104         uint16_t tcp_urgent;            /* urgent data pointer */
    105         tcph_t *tcp_hdr;                /* raw TCP header */
    106 } tcpinfo_t;

In the script I accessed the TCP seq and ack in arg[4], which was giving me errors in tcp:::receive, so I just switch these for the equivalents in arg[3]. Now I’m not exactly clear on the equivalence, but it seems

args[4]->tcp_seq ?= args[3]->tcps_suna
args[4]->tcp_ack ?= args[3]->tcps_rack

The ack=rack seems solid as tcps_rack = “highest sequence number for which we have received and sent an acknowledgement”.

The seq=tcps_suna is less clear to me as tcps_suna = “lowest sequence number for which we have sent data but not received acknowledgement”

But for my purposes, these distinctions might be unimportant as I’ve stopped looking at seq and ack and now look at outstanding unacknowledge bytes on the receiver and sender, which is

tcps_snxt-tcps_suna ” gives the number of bytes pending acknowledgement “
tcps_rsnxt – tcps_rack ” gives the number of bytes we have received but not yet acknowledged”

but more about that later. Let’s look at the new version of the program and the output

#!/usr/sbin/dtrace -s
#pragma D option defaultargs
#pragma D option quiet
inline string ADDR=$$1;
dtrace:::BEGIN
{       TITLE = 10;
       title = 0;
       walltime=timestamp;
       printf("starting up ...\n");
}
tcp:::send, tcp:::receive
/   title == 0 /
{   printf("%9s %8s %6s %8s %8s %12s %s \n",
        "delta"    ,
        "cid"    ,
        "pid"    ,
        "send" ,
        "recd"  ,
        "seq",
        "ack"
      );
     title=TITLE;
}
tcp:::send
/     ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) /
{    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    delta= timestamp-walltime;
    walltime=timestamp;
    printf("%9d %8d %6d %8d --> %8s %8d %8d %12s > %s \n",
        delta/1000,
        args[1]->cs_cid  % 100000,
        args[1]->cs_pid ,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        args[4]->tcp_seq  -
        args[3]->tcps_suna ,
        args[4]->tcp_ack -
        args[3]->tcps_rack ,
        args[3]->tcps_raddr,
        curpsinfo->pr_psargs
      );
    title--;
}
tcp:::receive
/ ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] /
{     delta=timestamp-walltime;
      walltime=timestamp;
      printf("%9d %8d %6d %8s <-- %-8d %8d %8d %12s < %s \n",
        delta/1000,
        args[1]->cs_cid  % 100000,
        args[1]->cs_pid ,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        args[3]->tcps_rack % 10000,
        args[3]->tcps_suna % 10000,
        args[3]->tcps_raddr,
        curpsinfo->pr_psargs
      );
    title--;
}

output

starting up ...
    delta      cid    pid     send     recd
      570     3904    845          <-- 0
       34     3904    845          <-- 140
      455     3904    845          <-- 0
       24     3904    845          <-- 0
  4789720     3904    845      124 -->
       82     3904    845      244 -->
       99     3904    845      132 -->
    delta      cid    pid     send     recd
       52     3904    845     1448 -->
       28     3904    845     1448 -->
       24     3904    845     1448 -->
       36     3904    845     1448 -->
       33     3904    845     1448 -->
       26     3904    845      952 -->
       86     3904    845      244 -->
      212     3904    845          <-- 140
      501     3904    845          <-- 132
       60     3904    845      124 -->
      256     3904    845          <-- 140
       72     3904    845          <-- 0
    39658     3904    845          <-- 0

What the heck is that huge time 4789720 us? ie 4 secs? The whole operation took me less than 1 second.
I wouldn’t have found the answer to this without help from Adam Levanthal. Turns out that output is in order only per CPU, but different CPUs can output data in different order. Each CPU buffers up data and then passed the buffer back to userland dtrace, so on a one CPU machine, this code will always output in order, but on a multi-cpu machine there is no guarentee on the order of the output. Lets add CPU # to the output:

#!/usr/sbin/dtrace -s
#pragma D option defaultargs
#pragma D option quiet
inline string ADDR=$$1;
dtrace:::BEGIN
{       TITLE = 10;
       title = 0;
       walltime=timestamp;
       printf("starting up ...\n");
}
tcp:::send, tcp:::receive
/   title == 0 /
{   printf("%9s %8s %6s %8s %8s %4s \n",
        "delta"    ,
        "cid"    ,
        "pid"    ,
        "send" ,
        "recd"  ,
        "cpu#"
      );
     title=TITLE;
}
tcp:::send
/     ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) /
{    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    delta= timestamp-walltime;
    walltime=timestamp;
    printf("%9d %8d %6d %8d --> %8s %d \n",
        delta/1000,
        args[1]->cs_cid  % 100000,
        args[1]->cs_pid ,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        cpu
      );
    title--;
}
tcp:::receive
/ ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] /
{     delta=timestamp-walltime;
      walltime=timestamp;
      printf("%9d %8d %6d %8s <-- %-8d %d \n",
        delta/1000,
        args[1]->cs_cid  % 100000,
        args[1]->cs_pid ,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        cpu
      );
    title--;
}

output

  delta      cid    pid     send     recd cpu#
       42     3904    845      244 -->          0
       66     3904    845      124 -->          0
     6091     3904    845      124 -->          2
       81     3904    845      244 -->          2
       96     3904    845      132 -->          2
       31     3904    845     1448 -->          2
       20     3904    845     1448 -->          2
       18     3904    845     1448 -->          2
       17     3904    845     1448 -->          2
       16     3904    845     1448 -->          2
  8910406     3904    845        0 -->          3
      375     3904    845          <-- 0        3
       24     3904    845          <-- 140      3
       34     3904    845        0 -->          3
      470     3904    845          <-- 140      3
      410     3904    845          <-- 132      3
    delta      cid    pid     send     recd cpu#
      491     3904    845          <-- 140      3
      393     3904    845          <-- 0        3
       15     3904    845      952 -->          3
       36     3904    845          <-- 0        3
    delta      cid    pid     send     recd cpu#
       19     3904    845          <-- 0        3
    40167     3904    845          <-- 0        3

what we see is the data ordered by CPU. In other words for each CPU the data is ordered but which CPU get’s printed first is unknown. In dtrace each CPU buffers up it’s data and then sends it to the userland dtrace process.
The only “fix” for now is to add a timestamp and order the data by timestamp.
Unsorted, it looks like:

607858102997348       281     3904    845      124 -->          2
607858103608856        84     3904    845      244 -->          2
607858104125414     delta      cid    pid     send     recd cpu#
607858104143731       116     3904    845      132 -->          2
607858104176769        33     3904    845     1448 -->          2
607858104198187        21     3904    845     1448 -->          2
607858104215813        17     3904    845     1448 -->          2
607858104233004        17     3904    845     1448 -->          2
607858104267629        34     3904    845     1448 -->          2
607858104287379        19     3904    845      952 -->          2
607858102716187  11569935     3904    845          <-- 132      3
607858103245377       248     3904    845          <-- 0        3
607858103282421        37     3904    845          <-- 140      3
607858103339076        56     3904    845      244 -->          3
607858103524093       185     3904    845          <-- 140      3
607858103774417       165     3904    845          <-- 132      3
607858103823145        48     3904    845      124 -->          3
607858104027216       204     3904    845          <-- 140      3
607858104387780       100     3904    845          <-- 0        3
607858104401487        13     3904    845          <-- 0        3
607858104520815       119     3904    845          <-- 0        3
607858144436175     delta      cid    pid     send     recd cpu#
607858144454625     39933     3904    845          <-- 0        3

sorted it looks like

607858102716187  11569935     3904    845          <-- 132      3
607858102997348       281     3904    845      124 -->          2
607858103245377       248     3904    845          <-- 0        3
607858103282421        37     3904    845          <-- 140      3
607858103339076        56     3904    845      244 -->          3
607858103524093       185     3904    845          <-- 140      3
607858103608856        84     3904    845      244 -->          2
607858103774417       165     3904    845          <-- 132      3
607858103823145        48     3904    845      124 -->          3
607858104027216       204     3904    845          <-- 140      3
607858104125414     delta      cid    pid     send     recd cpu#
607858104143731       116     3904    845      132 -->          2
607858104176769        33     3904    845     1448 -->          2
607858104198187        21     3904    845     1448 -->          2
607858104215813        17     3904    845     1448 -->          2
607858104233004        17     3904    845     1448 -->          2
607858104267629        34     3904    845     1448 -->          2
607858104287379        19     3904    845      952 -->          2
607858104387780       100     3904    845          <-- 0        3
607858104401487        13     3904    845          <-- 0        3
607858104520815       119     3904    845          <-- 0        3
607858144436175     delta      cid    pid     send     recd cpu#
607858144454625     39933     3904    845          <-- 0        3

so now the strange long time delta is at the beginning where I’d expect it.

I’m not quite sure how to deal with this. Post processing the data by sorting the timestamp column works, but interactively processing the data to get it in the right order as it comes out seems problematic.

Kyle Hailey
Uncategorized