TCP Trace Analysis for NFS

February 26th, 2013

Introduction

 

When there is a disparity in reported I/O latency between the NFS server and Oracle database on the NFS client the question arises:

Where does the difference in latency come from between NFS Server and Oracle’s timing of pread?

One strategy is to take tcp traces on NFS server and NFS client at the same time, during a period of load that shows the disparity.

The two trace files can then be analyzed to show latency at each side and the delta of latency between packets found in both traces

 

Files Required

 

Parsing script is ( thanks to Matt Amdur from Delphix for the  core code)

https://github.com/khailey/tcpdump/blob/master/parsetcp.pl

Requires tethereal which comes with the wireshark install on LINUX:

      yum install wireshark

Example situation

 

On NFS server the latency of 8K block sends was measured with DTrace and showed:

    avg ms
      30.2   
      24.4  
      21.9  
      23.7   
      32.0  
      25.5

On the NFS client, the Oracle database is measured with oramon.sh and shows the 8k read latency as

    avg ms
    126.15         
    103.91         
    117.33         
     96.46         
     78.43         
     91.94

There is a substantial difference between the latency seen by NFS server and the latency seen by Oracle. Where is the latency coming from? Is it a problem on the NFS server, the NFS client or the network? How do we find out? One strategy is to collect TCP dumps on the NFS client and NFS server and compare them.
Collecting TCP traces

First, set up a situation where there is a load running on the database and there is a discrepancy between latency on NFS server and as seen by Oracle. During this period of latency discrepancy collect TCP traces. The traces will be large, so run them for 20 second for example, to limit the size.

 

Collecting TCP dumps

TCP dumps can be collected on LINUX with tcpdump and on Solaris with snoop:

NFS server (Solaris NFS server example)

snoop -q -d aggr0 -s 512 -o nfs_server.cap 172.16.100.102
  • -d : device
  • -o : output file, give “.cap” extension to be recognized automatically by wireshark
  • -s : amount of packet to keep

NFS client host is the IP at the end of the command line

Linux (linux client example)

tcpdump -s 512  host 172.16.100.87 -w nfs_client.cap
  • -s : amount of packet to keep
  • -w : output file, give “.cap” extension to be recognized automatically by wireshark
  • host : IP of NFS server

Analyzing the Traces

 

The trace files can be analyzed on a linux machine with wireshark installed and with the script parsetcp.pl  on the two traces.

$ ./parsetcp.pl nfs_server.cap nfs_client.cap
 ==================== RAW DATA ============

Parsing NFS server trace: nfs_server.cap
type       avg ms count
   READ : 18.10,  12493  
  WRITE : 30.08,   4918   

Parsing NFS client trace: nfs_client.cap
type       avg ms count  
   READ : 18.74,  25974 
  WRITE : 30.05,  10197   

 ==================== MATCHED DATA  ============

READ
type       avg ms 
nfs server: 18.12
nfs_client: 19.26
     diff :  1.14 
Processed 14010 packets (Matched: 11964 Missed: 2046)

WRITE
type       avg ms  
 server : 30.86    
 client : 31.83
   diff :  0.96 
Processed 5806 packets (Matched: 4391 Missed: 1415)

The script will find the average latency for reads and writes on both ends and then try to match the packets found both at the client and server and report the average latency for reads and writes on the packets that match.
In the above output the read time on on the NFS client receiving end is on par with what is seen on NFS server.
Thus if the receive time on the NFS client host is on par with the NFS server and Oracle latency is much greater than the receive time, then there looks to be some issue on the NFS client host machine such as heavy memory paging that would cause the latency seen by Oracle to be much higher than that seen by the tcp traces.

 

Example problem analysis

 

In the following 2 examples a swingbench load was run on the Oracle database on the NFS client. The database files were mounted via NFS from the NFS server. In one case the NFS client and database host was Open Solaris and in the other case it was LINUX. In both cases the Oracle database reported much slower latency than the NFS server threads.

test 1 ) Open Solaris is the NFS client

 

Open Solaris is the NFS Server as well
NFS Server latency ( from DTrace of nfs start/done )

      avg_ms
       1.4    
       1.8  
       1.1  
       1.1  
       1.5  
       0.9   
       0.7    
       1.1  
       1.0    
       0.9  
       1.1

NFS client: the Oracle database shows read latency much slower at 29-65ms from oramon.sh:
single block

     avg ms     
     29.92          
     32.62       
     50.38        
     56.38            
     65.45          
     46.88         
     45.13       
     44.93            
     39.48            
     55.8

Test 1 TCP trace analysis

What do the tcp traces say? Tracing TCP on the NFS server side, and NFS client side indicates that the slow latencies are coming from the NFS server

 ==================== Individual HOST DATA ============
Parsing nfs server trace: nfs_server.cap
type       avg ms count   
   READ : 44.60,   7731    

Parsing client trace: client.cap
type       avg ms count    
   READ : 46.54,  15282    

 ==================== MATCHED DATA  ============
READ
type       avg ms   
 server : 48.39,   
 client : 49.42,   
   diff :  1.03,    
Processed 9647 packets (Matched: 5624 Missed: 4023)

TEST 1 summary:

The NFS Server says it is responding with 1ms latency (from DTrace) but the TCP traces show latency more on the order of 48ms which is in alignment with what Oracle sees on the NFS client, thus the NFS server must be introducing the large latency somehow and not the client.

 

Test 2) LINUX NFS client

Open Solaris is the NFS server

NFS Server 8K block latency (from DTrace)

     avg_ms 
       0.1    
       0.1   
       0.1   
       0.1   
       0.1   
       0.1

Oracle latency on LINUX from oramon.sh

   single block    
        ms      
     35.73           
     58.90          
     44.88         
     54.43         
     54.57          
     63.92

Again we see the same stats as the first test case, ie the NFS Server says it’s latency is low and the Oracle database on the NFS client says the latency is slow

TCP trace analysis

==================== Individual HOST DATA ============
Parsing NFS server trace: nfs_server.cap
type       avg ms count   
   READ :  1.17,   9042            

Parsing client trace: client.cap
type       avg ms count    
   READ :  1.49,  21984   

==================== MATCHED DATA  ============
READ
type       avg ms count    
 server :  1.03  
 client :  1.49
   diff :  0.46

Processed 13815 packets (Matched: 8169 Missed: 5646)
summary of test 2:

In this case the NFS Server TCP traces show fast latency as well as the NFS client TCP traces, thus the long latency times must be introduced on the client machine somehow and not the NFS Server.

 

Summary of test 1 and test 2

Tests are run with swingbench benchmark.
The Oracle databases are set with the parameter “filesystemio_options=SETALL” which enables direct I/O, so that all I/O requests from Oracle should be become NFS requests to NFS server.
Roughly the latency stats for 8k reads originating from the Oracle are

 

 Oracle on Linux  Oracle on Solaris  latency data source
 NFS Server  .1 ms   2 ms dtrace nfs:::op-read-start/op-read-done
 TCP trace NFS Server   1 ms  44 ms snoop
 TCP trace NFS Client   1.5  45 ms tcpdump on LINUX and snoop on Open Solaris
 Oracle  58 ms  47 ms “db file sequential read” wait (which is basically a timing of “pread” for 8k random reads specifically

 

In the above examples there was nothing outstanding in the vmstat results on the NFS client and the NFS server.

In both cases the problem arose with high level of swingbench sessions. The latency disparity appeared above a user load of 16. The latency disparity increased as user load increased.

It turned out that in the case where the NFS server was introducing the disparity that the maximum NFS threads on the server was 16. The solution was to increase the NFS server thread max: sharectl set -p servers=512 nfs.

In the case where the LINUX client introduce the disparity it turns out that the maximum outstanding NFS requests was 16. The solution was to raise sunrpc.tcp_slot_table_entries = 128.

The tcpdump.pl script didn’t find the answers but it did identify which side of the connection was at fault and it would identify if the network itself was the problem.

One thing to keep in mind is that snoop and tcpdump can be resource intensive and may introduce a major amount of latency. In the above cases with and without tcp tracing enabled showed no more than 1ms of difference and the latency being investigated was an order of magnitude greater. In other cases where there has been more I/O between the NFS server and client and/or the transfer sizes were higher (like 1M) then the impact of TCP tracing has eclipsed the latency disparity being investigated and  tcp tracing would not have been appropriate investigation path.



Other information

  • wireshark
    • display filters
      • http://wiki.wireshark.org/DisplayFilters
      • http://www.wiresharktraining.com/tips.html – graphing retrans
    • expressions
      • http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname=/usr/share/catman/man4/ethereal-filter.4.html
      • http://www.wireshark.org/docs/dfref/r/rpc.html
    • tcpdump
      • http://www.alexonlinux.com/tcpdump-for-dummies
    • snoop
      • http://www.cs.bgu.ac.il/~arik/usail/man/solaris/snoop.1.html
      • http://www.softpanorama.org/Net/Sniffers/snoop.shtml
    $ sudo snoop -D
    Using device aggr0 (promiscuous mode)
    kyle -> 172.16.100.92 drops: 0 TCP D=50461 S=22 Push Ack=2424998673 Seq=3990657488 Len=84 Win=49640
    172.16.100.92 -> kyle         drops: 0 TCP D=22 S=50461 Ack=3990657572 Seq=2424998673 Len=0 Win=254
    172.16.100.147 -> (broadcast)  drops: 0 ARP C Who is 172.16.100.115, 172.16.100.115 ?
    172.16.100.68 -> (broadcast)  drops: 0 ARP C Who is 172.16.100.115, 172.16.100.115 ?
    172.16.100.68 -> (broadcast)  drops: 0 ARP C Who is 172.16.100.184, 172.16.100.184 ?
    172.16.100.199 -> (broadcast)  drops: 0 ARP C Who is 172.16.100.200, 172.16.100.200 ?


Uncategorized

  1. Trackbacks

  2. No trackbacks yet.
  1. Comments

  2. mimmus
    July 5th, 2013 at 14:47 | #1

    I’m trying to use this script to understand why simply “dd”-ing a 1 GB file lasts more than 60 seconds. If write latency match on server and client and it’s about 12-13 ms, where can be the problem???

  3. July 7th, 2013 at 23:38 | #2

    @mimmus: what exactly is taking 12-13ms? Is that the total transfer time at the TCP level or is that the transfer time of NFS request? and if it’s the tranfer time of one NFS request then the question is, what is the size of the NFS request.
    what is the syntax of the dd command you are running? The important is the block size, ie “bs=” part. Also are you doing a dd that is writing to the NFS server or reading from the NFS server.
    A dd 1 GB of a 1 GbE should be around 100 MB/s thus around 10 secs, but if I change the bs size to something small, I can end up slowing the transfer time. The NFS I/O request size can be up to 1MB if the mount options are set the high and the OS supports it (HP and LINUX support 1MB, Solaris does too but you have to change a kernel parameter and AIX only supports 64K unless you use a patch to increase it)
    If you just want to see what the throughput is across the wire then it might be better to use a tool like netio or netsurf which will test the throughput at different I/O transfer sizes over TCP.
    In the above example I know the NFS request size is one database datablock size which is 8K

  4. mimmus
    July 10th, 2013 at 08:37 | #3

    Hi, the test command is:
    dd if=/dev/zero of=mmm bs=1024 count=1000000

    The transfer occurs on a 10 Gbps network but it lasts +60 secs!!! On another server with same setup, it lasts 4-5 secs!
    Mount options are:
    rw,bg,hard,nointr,rsize=32768,wsize=32768,tcp,vers=3
    (Oracle suggests these).

    I tested network using iperf and I can achieve almost 10 Gbps full, the bottleneck is elsewhere.

  5. July 11th, 2013 at 23:11 | #4

    @mimmus:
    Have you tried using netio, netsurf or iperf instead of dd?
    Have you tried bs=32768?
    A bs=1024 is small.
    if “mmm” is a location on an NFS server and
    NFS transfers are going to take around 200us on the fast side.
    and if a transfer is 1024 that is 1K then that’s 5k/ms or 5M/sec 300MB/min which is even slower than you are seeing.
    In this case I would guess there is some aggregation going on already to hit 1GB/m
    If your NFS exchanges are 12ms then the transfer sizes have to be around 200K
    So in that case why not increase your bs explicitly?

    For transfer rates it depends on the layers :
    program, dd in this case
    os, nfs client
    tcp nfs client
    nic, nfs client
    network
    nic nfs server
    tcp nfs server
    os nfs server
    nfs on nfs server
    I/O nfs server
    you want to see which layer the time is being spent

    you are doing a dd write from nfs client to nfs server so first big question is how fast is the I/O write speed on the NFS server.
    The script on this blog post isolates latency at the nfs client tcp layer to the nfs server tcp layer
    if both those layers are slow then the problem is on the nfs server at either the nfs server os, nfs code path or I/O
    The script on this blog post doesn’t look at transfer sizes. It should for cases like this. In my case I knew the transfer sizes were 8K. In your case the transfer size has to be bigger than 1K if you are seeing 1GB/min with 12-13ms NFS exchanges.

  6. mimmus
    July 16th, 2013 at 07:17 | #5

    I understand.
    I think that cache on the NFS server (an EMC filer) distorts these simplistic tests.

  7. July 16th, 2013 at 18:13 | #6

    Hi Mimmus, check out a previous blog post http://dboptimizer.com/2011/07/19/nfs-throughput-testing-trick-solaris/ on using dd with ramdisk and/or memory mapped files to cut out the file system. For dd and caching, on Linux you can use flags like

    write
    dd if=/dev/zero of=toto bs=8k count=100 oflag=direct
    read
    dd of=/dev/null if=toto bs=8k count=100 iflag=direct

You must be logged in to post a comment.