DB Optimizer » TCP throughput strangeness

TCP throughput strangeness

May 16th, 2011

I don’t have an answer to the following TCP performance difference between two machines, but thought I’d post about what I’ve seen as I think it’s interesting, and maybe someone else has ideas.
I’m running netio (http://freshmeat.net/projects/netio/) on one machine (opensolaris) and contacting two different Linux machines (both on 2.6.18-128.el5 ), machine A and machine B.
Machine A has a network throughput of 10MB/sec with netio and machine B 100MB/sec with netio. Netio is set to send 32K chunks:

Linux machine: netio -s -b 32k -t -p 1234
Opensolaris: netio -b 32k -t -p 1234 linuxmachine

On the opensolaris machine I used dtrace to trace the connections. From dtrace, all the interactions TCP settings look the same – same windows sizes on the receive and send, same ssthresh, same congestion window sizes, but the slow machine is sending an ACK for every 2 or 3 receives whereas the fast machine is sending an ACK every 12 receives.
All three machines are on the same switch.
Here is the Dtrace output:
Fast Machine:

delta send   recd
 (us) bytes  bytes  swnd snd_ws   rwnd rcv_ws   cwnd    ssthresh
  122 1448 \      195200      7 131768      2 128872  1073725440
   37 1448 \      195200      7 131768      2 128872  1073725440
   20 1448 \      195200      7 131768      2 128872  1073725440
   18 1448 \      195200      7 131768      2 128872  1073725440
   18 1448 \      195200      7 131768      2 128872  1073725440
   18 1448 \      195200      7 131768      2 128872  1073725440
   18 1448 \      195200      7 131768      2 128872  1073725440
   19 1448 \      195200      7 131768      2 128872  1073725440
   18 1448 \      195200      7 131768      2 128872  1073725440
   18 1448 \      195200      7 131768      2 128872  1073725440
   57 1448 \      195200      7 131768      2 128872  1073725440
  171 1448 \      195200      7 131768      2 128872  1073725440
   29  912 \      195200      7 131768      2 128872  1073725440
   30      /    0 195200      7 131768      2 128872  1073725440

slow machine:

delta send   recd
 (us) bytes  bytes  swnd snd_ws   rwnd rcv_ws   cwnd    ssthresh
  161      /    0 195200     7 131768      2 127424   1073725440
   52 1448 \      195200     7 131768      2 128872   1073725440
   33 1448 \      195200     7 131768      2 128872   1073725440
   11 1448 \      195200     7 131768      2 128872   1073725440
  143      /    0 195200     7 131768      2 128872   1073725440
   46 1448 \      195200     7 131768      2 130320   1073725440
   31 1448 \      195200     7 131768      2 130320   1073725440
   11 1448 \      195200     7 131768      2 130320   1073725440
  157      /    0 195200     7 131768      2 130320   1073725440
   46 1448 \      195200     7 131768      2 131768   1073725440
   18 1448 \      195200     7 131768      2 131768   1073725440

Dtrace code

dtrace: 130717 drops on CPU 0
#!/usr/sbin/dtrace -s
#pragma D option quiet
#pragma D option defaultargs
inline int TICKS=$1;
inline string ADDR=$$2;
dtrace:::BEGIN
{
       TIMER = ( TICKS != NULL ) ?  TICKS : 1 ;
       ticks = TIMER;
       TITLE = 10;
       title = 0;
       walltime=timestamp;
       printf("starting up ...\n");
}
tcp:::send
/     ( args[2]->ip_daddr == ADDR || ADDR == NULL ) /
{
    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    delta= timestamp-walltime;
    walltime=timestamp;
    printf("%6d %8d \ %8s  %8d %8d %8d  %8d %8d %12d %12d %12d %8d %8d  %d  \n",
        delta/1000,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        args[3]->tcps_sack_fack,
        args[3]->tcps_sack_snxt,
        args[3]->tcps_rto,
        args[3]->tcps_mss,
        args[3]->tcps_retransmit
      );
    flag=0;
    title--;
}
tcp:::receive
/ ( args[2]->ip_saddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] /
{
      delta=timestamp-walltime;
      walltime=timestamp;

      printf("%6d %8s / %8d  %8d %8d %8d  %8d %8d %12d %12d %12d %8d %8d  %d  \n",
        delta/1000,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        args[3]->tcps_sack_fack,
        args[3]->tcps_sack_snxt,
        args[3]->tcps_rto,
        args[3]->tcps_mss,
        args[3]->tcps_retransmit
      );
    flag=0;
    title--;
}

Followup , since I did the above, I have since changed the dtrace code to include the number of unacknowledged bytes and it turns out the slow code does run up it’s unacknowleged bytes until it hits the congestion window, where as the fast machine never hits it’s congestion window:

unack    unack    delta  bytes   bytes       send   receive  cong       ssthresh
bytes    byte      us     sent   recieved    window window    window
sent     recieved
139760      0     31     1448 \             195200  131768   144800   1073725440
139760      0     33     1448 \             195200  131768   144800   1073725440
144104      0     29     1448 \             195200  131768   146248   1073725440
145552      0     31          / 0           195200  131768   144800   1073725440
145552      0     41     1448 \             195200  131768   147696   1073725440
147000      0     30          / 0           195200  131768   144800   1073725440
147000      0     22     1448 \             195200  131768    76744        72400
147000      0     28          / 0           195200  131768    76744        72400
147000      0     18     1448 \             195200  131768    76744        72400
147000      0     26          / 0           195200  131768    76744        72400
147000      0     17     1448 \             195200  131768    76744        72400
147000      0     27          / 0           195200  131768    76744        72400
147000      0     18     1448 \             195200  131768    76744        72400
147000      0     56          / 0           195200  131768    76744        72400
147000      0     22     1448 \             195200  131768    76744        72400

dtrace code:

#!/usr/sbin/dtrace -s
#pragma D option quiet
#pragma D option defaultargs
inline int TICKS=$1;
inline string ADDR=$$2;
tcp:::send, tcp:::receive
/     ( args[2]->ip_daddr == ADDR || ADDR == NULL ) /
{
    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    delta= timestamp-walltime;
    walltime=timestamp;
    printf("%6d %6d %6d %8d \ %8s  %8d %8d %8d  %8d %8d %12d %12d %12d %8d %8d  %d  \n",
        args[3]->tcps_snxt - args[3]->tcps_suna ,
        args[3]->tcps_rnxt - args[3]->tcps_rack,
        delta/1000,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        args[3]->tcps_sack_fack,
        args[3]->tcps_sack_snxt,
        args[3]->tcps_rto,
        args[3]->tcps_mss,
        args[3]->tcps_retransmit
      );
}
tcp:::receive
/ ( args[2]->ip_saddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] /
{
      delta=timestamp-walltime;
      walltime=timestamp;
      printf("%6d %6d %6d %8s / %-8d  %8d %8d %8d  %8d %8d %12d %12d %12d %8d %8d  %d  \n",
        args[3]->tcps_snxt - args[3]->tcps_suna ,
        args[3]->tcps_rnxt - args[3]->tcps_rack,
        delta/1000,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        args[3]->tcps_sack_fack,
        args[3]->tcps_sack_snxt,
        args[3]->tcps_rto,
        args[3]->tcps_mss,
        args[3]->tcps_retransmit
      );
}

So the fact that it looked like the slow machine was acknowledging every second or third send was due to the fact that the receiver was already behind on acknowledging previous packets.

Now the question remains is why did the receiver get so far behind on one machine and not the other.

I’ve check the rmem values on both machines and set the to the same:

net.core.rmem_default=4194304

net.core.rmem_max=4194304

To be continued

Kyle Hailey
Uncategorized