TCP throughput strangeness
I don’t have an answer to the following TCP performance difference between two machines, but thought I’d post about what I’ve seen as I think it’s interesting, and maybe someone else has ideas.
I’m running netio (http://freshmeat.net/projects/netio/) on one machine (opensolaris) and contacting two different Linux machines (both on 2.6.18-128.el5 ), machine A and machine B.
Machine A has a network throughput of 10MB/sec with netio and machine B 100MB/sec with netio. Netio is set to send 32K chunks:
Linux machine: netio -s -b 32k -t -p 1234
Opensolaris: netio -b 32k -t -p 1234 linuxmachine
On the opensolaris machine I used dtrace to trace the connections. From dtrace, all the interactions TCP settings look the same – same windows sizes on the receive and send, same ssthresh, same congestion window sizes, but the slow machine is sending an ACK for every 2 or 3 receives whereas the fast machine is sending an ACK every 12 receives.
All three machines are on the same switch.
Here is the Dtrace output:
Fast Machine:
delta send recd (us) bytes bytes swnd snd_ws rwnd rcv_ws cwnd ssthresh 122 1448 \ 195200 7 131768 2 128872 1073725440 37 1448 \ 195200 7 131768 2 128872 1073725440 20 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 19 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 18 1448 \ 195200 7 131768 2 128872 1073725440 57 1448 \ 195200 7 131768 2 128872 1073725440 171 1448 \ 195200 7 131768 2 128872 1073725440 29 912 \ 195200 7 131768 2 128872 1073725440 30 / 0 195200 7 131768 2 128872 1073725440
slow machine:
delta send recd (us) bytes bytes swnd snd_ws rwnd rcv_ws cwnd ssthresh 161 / 0 195200 7 131768 2 127424 1073725440 52 1448 \ 195200 7 131768 2 128872 1073725440 33 1448 \ 195200 7 131768 2 128872 1073725440 11 1448 \ 195200 7 131768 2 128872 1073725440 143 / 0 195200 7 131768 2 128872 1073725440 46 1448 \ 195200 7 131768 2 130320 1073725440 31 1448 \ 195200 7 131768 2 130320 1073725440 11 1448 \ 195200 7 131768 2 130320 1073725440 157 / 0 195200 7 131768 2 130320 1073725440 46 1448 \ 195200 7 131768 2 131768 1073725440 18 1448 \ 195200 7 131768 2 131768 1073725440
Dtrace code
dtrace: 130717 drops on CPU 0 #!/usr/sbin/dtrace -s #pragma D option quiet #pragma D option defaultargs inline int TICKS=$1; inline string ADDR=$$2; dtrace:::BEGIN { TIMER = ( TICKS != NULL ) ? TICKS : 1 ; ticks = TIMER; TITLE = 10; title = 0; walltime=timestamp; printf("starting up ...\n"); } tcp:::send / ( args[2]->ip_daddr == ADDR || ADDR == NULL ) / { nfs[args[1]->cs_cid]=1; /* this is an NFS thread */ delta= timestamp-walltime; walltime=timestamp; printf("%6d %8d \ %8s %8d %8d %8d %8d %8d %12d %12d %12d %8d %8d %d \n", delta/1000, args[2]->ip_plength - args[4]->tcp_offset, "", args[3]->tcps_swnd, args[3]->tcps_snd_ws, args[3]->tcps_rwnd, args[3]->tcps_rcv_ws, args[3]->tcps_cwnd, args[3]->tcps_cwnd_ssthresh, args[3]->tcps_sack_fack, args[3]->tcps_sack_snxt, args[3]->tcps_rto, args[3]->tcps_mss, args[3]->tcps_retransmit ); flag=0; title--; } tcp:::receive / ( args[2]->ip_saddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] / { delta=timestamp-walltime; walltime=timestamp; printf("%6d %8s / %8d %8d %8d %8d %8d %8d %12d %12d %12d %8d %8d %d \n", delta/1000, "", args[2]->ip_plength - args[4]->tcp_offset, args[3]->tcps_swnd, args[3]->tcps_snd_ws, args[3]->tcps_rwnd, args[3]->tcps_rcv_ws, args[3]->tcps_cwnd, args[3]->tcps_cwnd_ssthresh, args[3]->tcps_sack_fack, args[3]->tcps_sack_snxt, args[3]->tcps_rto, args[3]->tcps_mss, args[3]->tcps_retransmit ); flag=0; title--; }
Followup , since I did the above, I have since changed the dtrace code to include the number of unacknowledged bytes and it turns out the slow code does run up it’s unacknowleged bytes until it hits the congestion window, where as the fast machine never hits it’s congestion window:
unack unack delta bytes bytes send receive cong ssthresh bytes byte us sent recieved window window window sent recieved 139760 0 31 1448 \ 195200 131768 144800 1073725440 139760 0 33 1448 \ 195200 131768 144800 1073725440 144104 0 29 1448 \ 195200 131768 146248 1073725440 145552 0 31 / 0 195200 131768 144800 1073725440 145552 0 41 1448 \ 195200 131768 147696 1073725440 147000 0 30 / 0 195200 131768 144800 1073725440 147000 0 22 1448 \ 195200 131768 76744 72400 147000 0 28 / 0 195200 131768 76744 72400 147000 0 18 1448 \ 195200 131768 76744 72400 147000 0 26 / 0 195200 131768 76744 72400 147000 0 17 1448 \ 195200 131768 76744 72400 147000 0 27 / 0 195200 131768 76744 72400 147000 0 18 1448 \ 195200 131768 76744 72400 147000 0 56 / 0 195200 131768 76744 72400 147000 0 22 1448 \ 195200 131768 76744 72400
dtrace code:
#!/usr/sbin/dtrace -s #pragma D option quiet #pragma D option defaultargs inline int TICKS=$1; inline string ADDR=$$2; tcp:::send, tcp:::receive / ( args[2]->ip_daddr == ADDR || ADDR == NULL ) / { nfs[args[1]->cs_cid]=1; /* this is an NFS thread */ delta= timestamp-walltime; walltime=timestamp; printf("%6d %6d %6d %8d \ %8s %8d %8d %8d %8d %8d %12d %12d %12d %8d %8d %d \n", args[3]->tcps_snxt - args[3]->tcps_suna , args[3]->tcps_rnxt - args[3]->tcps_rack, delta/1000, args[2]->ip_plength - args[4]->tcp_offset, "", args[3]->tcps_swnd, args[3]->tcps_snd_ws, args[3]->tcps_rwnd, args[3]->tcps_rcv_ws, args[3]->tcps_cwnd, args[3]->tcps_cwnd_ssthresh, args[3]->tcps_sack_fack, args[3]->tcps_sack_snxt, args[3]->tcps_rto, args[3]->tcps_mss, args[3]->tcps_retransmit ); } tcp:::receive / ( args[2]->ip_saddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] / { delta=timestamp-walltime; walltime=timestamp; printf("%6d %6d %6d %8s / %-8d %8d %8d %8d %8d %8d %12d %12d %12d %8d %8d %d \n", args[3]->tcps_snxt - args[3]->tcps_suna , args[3]->tcps_rnxt - args[3]->tcps_rack, delta/1000, "", args[2]->ip_plength - args[4]->tcp_offset, args[3]->tcps_swnd, args[3]->tcps_snd_ws, args[3]->tcps_rwnd, args[3]->tcps_rcv_ws, args[3]->tcps_cwnd, args[3]->tcps_cwnd_ssthresh, args[3]->tcps_sack_fack, args[3]->tcps_sack_snxt, args[3]->tcps_rto, args[3]->tcps_mss, args[3]->tcps_retransmit ); }
So the fact that it looked like the slow machine was acknowledging every second or third send was due to the fact that the receiver was already behind on acknowledging previous packets.
Now the question remains is why did the receiver get so far behind on one machine and not the other.
I’ve check the rmem values on both machines and set the to the same:
Trackbacks
Comments
Did you have a look at the output of ethtool on both linux machines? Maybe one is running on half duplex.
http://www.softpanorama.org/Net/Netutils/ethtool.shtml
@Oliver
@Oliver
Sure enough, one of the NICs had downgraded to 100MB !
Solved that issue.
Unfortunately the main machine I’m having issues with is AIX which is having similar issues and the NIC is running at 1GE
$ entstat -d en0 | grep -i speed
Media Speed Selected: Autonegotiate
Media Speed Running: 100 Mbps, Full Duplex
External-Network-Switch (ENS) Port Speed: 100 Mbps, Full Duplex
On this machine we are going to try and switch the NIC and look for possible driver upgrades.