NVMe Throughput Testing with Python

Question

currently I need to do some throughput testing. My hardware setup is that I have a Samsung 950 Pro connected to an NVMe controller that is hooked to the motherboard via and PCIe port. I have a Linux nvme device corresponding to the device which I have mounted at a location on the filesystem.

My hope was to use Python to do this. I was planning on opening a file on the file system where the SSD is mounted, recording the time, writing some n length stream of bytes to the file, recording the time, then closing the file using os module file operation utilities. Here is the function to gauge write throughput.

def perform_timed_write(num_bytes, blocksize, fd):
    """
    This function writes to file and records the time

    The function has three steps. The first is to write, the second is to
    record time, and the third is to calculate the rate.

    Parameters
    ----------
    num_bytes: int
        blocksize that needs to be written to the file
    fd: string
        location on filesystem to write to

    Returns
    -------
    bytes_per_second: float
        rate of transfer
    """
    # generate random string
    random_byte_string = os.urandom(blocksize)

    # open the file
    write_file = os.open(fd, os.O_CREAT | os.O_WRONLY | os.O_NONBLOCK)        
    # set time, write, record time
    bytes_written = 0
    before_write = time.clock()
    while bytes_written < num_bytes:
        os.write(write_file, random_byte_string)
        bytes_written += blocksize
    after_write = time.clock()

    #close the file
    os.close(write_file)

    # calculate elapsed time
    elapsed_time = after_write - before_write

    # calculate bytes per second
    bytes_per_second = num_bytes / elapsed_time


    return bytes_per_second

My other method of testing is to use Linux fio utility. https://linux.die.net/man/1/fio

After mounting the SSD at /fsmnt/fs1, I used this jobfile to test the throughput

;Write to 1 file on partition
[global]
ioengine=libaio
buffered=0
rw=write
bs=4k
size=1g
openfiles=1

[file1]
directory=/fsmnt/fs1

I noticed that the write speed returned from the Python function is significantly higher than that of the fio. Because Python is so high-level there is a lot of control you give up. I am wondering if Python is doing something under the hood to cheat its speeds higher. Does anyone know why Python would generate write speeds so much higher than those generated by fio?

`os.write()` returns the number of bytes written, which you should add to `bytes_written` on each loop iteration. It's possible there are short writes. What is the relative speed of the Python test to the fio test? — bnaecker, Jan 17 '18 at 22:36
Python:fio speeds 4:1. also your suggestion is noted. I updated the software to do that because you have no gaurentee that the write will return blocksize. it could fail and return 0. having added it, though the results looks the same — John Frye, Jan 18 '18 at 15:33
I'm also suspicious that the non-blocking I/O option is changing the meaning of the timing. Why are you doing that? If you're interested in determining the actual I/O speeds of the device, you'd really want to have the write call return only after the I/O has truly completed. As it stands, you're probably measuring something more like the time the non-blocking write syscall takes, which the FIO tool might be smart enough to compensate for. Just a guess. — bnaecker, Jan 18 '18 at 16:28
perhaps, but the results were more or less identical with the first implementation which used the classical with open() file descriptors. I did this because I need to write to multiple partitions at once, which I believe I accomplished by multithreading although the results suggest not, but that is another problem in and of itself. — John Frye, Jan 18 '18 at 17:03

Anon · Accepted Answer · 2018-02-05T09:09:25.593

The reason your Python program does better than your fio job is because this is not a fair comparison and they are testing different things:

You banned fio from using Linux's buffer cache (by using buffered=0 which is the same as saying direct=1) by telling it to do O_DIRECT operations. With the job you specified, fio will have to send down a single 4k write and then wait for that write to complete at the device (and that acknowledgement has to get all the way back to fio) before it can send the next.
Your Python script is allowed to send down writes that can be buffered at multiple levels (e.g. within userspace by the C library and then again in the buffer cache of the kernel) before touching your SSD. This will generally mean the writes will be accumulated and merged together before being sent down to the lower level resulting in chunkier I/Os that have less overhead. Further, since you don't do any explicit flushing in theory no I/O has to be sent to the disk before your program exits (in practice this will depend on a number of factors like how much I/O you do, the amount of RAM Linux can set aside for buffers, the maximum time the filesystem will hold dirty data for, how long you do the I/O for etc)! Your os.close(write_file) will just be turned into an fclose() which says this in its Linux man page:

Note that fclose() flushes only the user-space buffers provided by the C library. To ensure that the data is physically stored on disk the kernel buffers must be flushed too, for example, with sync(2) or fsync(2).

In fact you take your final time before calling os.close(), so you may even be omitting the time it took for the final "batches" of data to be sent only to the kernel let alone the SSD!

Your Python script is closer to this fio job:

[global]
ioengine=psync
rw=write
bs=4k
size=1g

[file1]
filename=/fsmnt/fio.tmp

Even with this fio is still at a disadvantage because your Python program has userspace buffering (so bs=8k may be closer).

The key takeaway is your Python program is not really testing your SSD's speed at your specified block size and your original fio job is a bit weird, heavily restricted (the libaio ioengine is asynchronous but with a depth of 1 you're not going to be able to benefit from that and that's before we get to the behaviour of Linux AIO when using filesystems) and does different things to your Python program. if you're not doing significantly more buffered I/O compared to the size of the largest buffer (and on Linux the kernel's buffer size scales with RAM) and if the buffered I/Os are small the exercise turns into a demonstration of the effectiveness of buffering.

score 1 · Answer 2 · answered Jan 11 '19 at 16:00

If you need the exact performance of the NVMe device, fio is the best choice. FIO can write test data to the device directly, without any file system. Here is an example:

[global]
ioengine=libaio
invalidate=1
iodepth=32
time_based
direct=1
filename=/dev/nvme0n1

[write-nvme]
stonewall
bs=128K
rw=write
numjobs=1
runtime=10000

SPDK is another choice. There is an existed example of performance test at https://github.com/spdk/spdk/tree/master/examples/nvme/perf.

Pynvme, which is based on SPDK, is a Python extension. You can write performance test with its ioworker().

NVMe Throughput Testing with Python

2 Answers2