Bursty NFS Transfers
I noticed some odd behavior the other day when transferring a large amount of files to my NFS server. When watching the transfer the file copies would happen in large 'bursts' with reads from the disk occurring, then stopping followed by large long transfers on the network (saturating the gigabit uplink)
It turns out i had been bitten by write caching and specifically the writeback behavior of Linux. Linux has several tunable knobs related to writeback of dirty pages from the FS layer back out to disk:
Of the flags above, the bytes vs ratio allows you to specify the buffer size as a percentage of system memory (via the ratio version) or an exact amount (via the bytes amount). Writing to either one of these files causes the other to return 0.
The difference between the 'background' and non background is a bit more subtle. Linux will start writing data back out to the backing store once the pages 'expire' OR once the background threshold has been reached. If more data is accumulated this will be accumulated until it hits the non background threshold and the writer will block while Linux catches up.
The other 2 flags to consider is the dirtytime_expire and the dirty_writeback. The dirty_writeback controls how often the kernel flusher wakes up to see if data needs to be written and dirtytime_expire is how old data must be before being a candidate for write out. On my system the kernel wakes up every 5 seconds to write out data older than 30 seconds.
Now that a bit of background is out of the way lets dive into the main reason i had issues. The machine i was using for the transfer has a modest amount of memory for a modern system (16GB) and the dirty_ratio and dirty_background_ratio where set to 20 and 10, meaning at the 10% threshold Linux would start write back and at 20% it would pause the writer. with 16GB of ram this corresponds to 1.6GB to start flushing and 3.2GB to start write out.
I was maxing out the disks at 100MB/s and would frequently see 1.6GB of data read before data started to be written out. However as this roughly corresponded to the file size i did not see an overlap of data transfer often and this appeared to all be flushed out by the 'cp' command before starting the next file.
Normally it would be a good idea to bunch up writes in case data gets overridden and to take advantage of adjacent data, But in this case the pauses where hurting transfer performance. I had 2 options, Reduce the buffer size to something saner or reduce the expire time. In this particular instance i chose to reduce the buffer size to 100MB for background writeout and 200MB for blocking corresponding to about 2 Seconds at full disk transfer as there seemed to be little to no benefit to have a buffer that large and potential issues in the event of a power loss.
Why write such a long and wordy article on something so simple? When checking NFS tuning guides i found a distinct lack of mention of this tunable and its affect on NFS write performance. This is not that surprising as the authoritative guide for NFS tuning in Linux is from the Linux documentation project and many NFS tuning guide articles are nothing more than copy and pastes from it. The defaults that made sense when that article where written do not necessarily make sense today and as a warning i would nor copy those settings verbatim.
I did manage to find 2 types of references to tuning this flag, one as a Novell knowladge base support note and as part of tuning NFS for use with VMWare ESX. These are not sources i would typically turn to for NFS tunning information however the information they provided turned out to be spot on.
Lastly i thought it was important to write this down as i had long assumed that /proc/sys/vm was for tunning the allocation of kernel memory and caching for block device. the idea that is may affect a network FS never occurred to me and was something i glossed over. only on a whim did i go off and change these settings and find the issue the first time i looked. For more infomation take a look here
As a side note, dstat was an incredibly useful tool for diagnosing this behavior. by spitting out a line of system stats one per second it was easy to see this bursting behavior and time it (by counting off the number of lines) if you have not used this tool before i would highly recommend giving it a go before you reach for top or htop to get a feel for your system.
So there you have it, if you are seeing bursty NFS writes from an NFS client and that NFS client has a tonne of ram in it, then perhaps you need to tune the system to have a smaller cache rather than a larger one.
Until my next folly.