How to tune system parameters so numpy's load() and save() achieve maximal bandwidth for AWS HDD volumes?

Question

How can I change the size of each IO read and write operation, performed by python2.7?

I'm trying to use AWS EBS HDD storage, which limits the bandwidth by placing limits on the number of IO operations and the size of each operation. To quote from the AWS volume type specs:

** gp2/io1 based on 16 KiB I/O size, st1/sc1 based on 1 MiB I/O size

Running iostat -xmdtz 1 on my machine, the typical output is this:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00 1435.00    0.00   179.12     0.00   255.64     1.77    1.22    1.22    0.00   0.69  99.60

So it looks like the IO size python uses is 256KB. My question is:

How can I change that to 1MB, to realize the full bandwidth potential offered by AWS?

Though I think the IO operation size in python is determined by some lower level module (io?) for what it's worth, the relevant part of the code reads as follows: x is a memmapped numpy array, loaded like so

x = np.load("...", mmap_mode = 'r')

and then the part of the code that actually reads it is the last line in this code snippet:

shared_x_base = multiprocessing.Array(ctypes.c_uint32, n1*k, lock=False)
shared_x = np.ctypeslib.as_array(shared_x_base)
shared_x = shared_x.reshape(n1, k)
shared_x[:] = x[:]

EDIT: For writing, there's an initial surge in size (and bandwidth) which looks like this:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00   29.00 2033.00     3.62   507.84   507.99    59.37   28.83   33.93   28.76   0.48 100.00

but then it settles down to this:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00 1673.00    0.00   207.12     0.00   253.55     1.78    1.06    1.06    0.00   0.59  98.80

EDIT: I've also tried removing the memmapping, and just using np.load and np.save (this answer suggests this is the way to go, and either way I thought it will help clarify what the source of the problem is. The performence is even worse:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00  589.00    0.00    73.62     0.00   256.00     1.88    3.19    3.19    0.00   1.68  99.20

Since I'm less certain the problem is really with python's IO operation size (see Martijn Pieters' very helpful answer), I would like to ask more generally:

How can I tune the system parameters to make np.load() and np.save() operations (with or without memmapping) work at the maximal bandwidth possible under the AWS throttling policy?

Martijn Pieters · Answer 1 · 2019-04-17T13:32:53.643

1

You are opening the array as a memmory-mapped object, which uses the mmap module under the hood. That ultimately uses the mmap system call and is not further configurable.

Instead, I/O block sizes for mmaped files are controlled by the kernel, but can be discovered via the mmap.PAGESIZE value, or on the command line with getconf PAGESIZE.

You can probably tune this size by making sure transparent hugepages support is enabled in the kernel you are running.

However, iostat statistics are heavily affected by the kernel I/O cache tuning paramaters. From the iostat manpage:

The iostat command generates reports that can be used to change system configuration to better balance the input/output load between physical disks.

That first 'burst' you see is because iostat gives you overall system stats from the time the system was booted:

The first report generated by the iostat command provides statistics concerning the time since the system was booted. Each subsequent report covers the time since the previous report.

Don't interpret those numbers as being caused by your Python code.

If you want to tune the kernel I/O cache, see Performance Tuning on Linux - Disk I/O, but take into account that AWS probably already has tuned this appropriately for network-connected storage.

edited Apr 17 '19 at 13:32

answered Apr 17 '19 at 11:57

Martijn Pieters

1,048,767
296
4,058
3,343

`ubuntu@ip-10-140-38-8:/sortedVolume$ getconf PAGESIZE 4096` presumably, that means 4096 *KB*? it doesn't seem to agree with what I get from `iostat`. Also, I noticed that initially the writing occurs at larger blocks (`avgqu-sz` column) of 508, but then it goes back down to 254 or so. Any ideas on how to reconcile these numbers? – Just Me Apr 17 '19 at 12:37
@JustMe: no, bytes, so 4KB, memory pages are relatively small. I think you are looking at the kernel I/O cache layer here. – Martijn Pieters Apr 17 '19 at 13:25
@JustMe: I've confirmed that iostat measures the kernel-to-blockdevice performance metrics. If you want to tune that, tune the kernel, not Python. – Martijn Pieters Apr 17 '19 at 13:37
Thanks for all the tips. I've removed the mem-map dependency (or at least, the explicit mem-map dependency). The performence seems even worse, though the `iostat` reported size is the same (see the latest edit to my question). I'm stumped. Btw, it seems none of the devices (including the one I'm working with) has an "elevator" configured: `$ cat /sys/block/*/queue/scheduler none none ...` – Just Me Apr 18 '19 at 09:36
Oh, one last thing: the writing `iostat` reports I included were the periodic "real time" ones generated after the initial report, there really was a surge followed by a decline. – Just Me Apr 18 '19 at 09:46

How to tune system parameters so numpy's load() and save() achieve maximal bandwidth for AWS HDD volumes?

1 Answers1