FIO latency percentile changes over time

Question

I want to measure and plot the latency percentile change over time for an SSD. If anyone did something similar please share any advice you might have. I am interested in both how to run FIO and how to process the results.

I will describe first the testing methodology I want to use, then describe what I have done so far (and works imperfectly), and finally ask a couple of questions.

Goal:

I want to keep track of latency average and 95%, 99%, 99.9% latency percentiles over time. Obviously, these measures are implicitly defined over a time window that I would like to be able to set to something like 10-60s intervals.
I want to compare how these latency percentiles change as I vary the IO pattern at a constant device load. I need to be able to control the total load (the amount of data send to the device) to make sure that the percentiles are actually comparable. A simple example would be: a) have a single thread that writes sequentially 200MB/s vs. b) 2 threads that write 100MB/s. It would be meaningless to compare percentiles if the total throughput between the two experiments were different.

What I tried so far:

Custom version of FIO to increase the resolution of the latency histograms. This is probably not needed.
I turned on json+ output so that I get the nice latency histograms. However, these histograms aggregate the whole FIO run so I have no way to measure the latency change over time.
To get the latency change over time, I thought of starting many small FIO jobs one after another. For example, if I want to get the latency over 1h, I start 120 FIO runs of 30s and save each output to a different file. Each output would give me the latency percentiles over 30s. However, there are two problems with this approach:
1. There is a long time required for the FIO startup (about 15-20s) and these allows the SSD to perform GC and repair write performance.
2. For sequential writes, the writing offset is reset at the start of each FIO job. This means that the new FIO run does not actually continue writing sequentially and, even worse, some portions of the device might not be written at all.

Questions:

Is there a method to use FIO to keep track of latency changes over time. If so, could you please provide an example?
For sequential writes, how could I increase throughput? By default, FIO for sequential writes uses iodepth 1 (queue depth 1). I don't see a clear way of increasing throughput over that. Increasing the iodepth does not seem to help.
I saw there are some python scripts in the FIO git repo for plotting. Would any of these be useful? Could anyone point me to some example that resembles what I want to do?

Anon · Answer 1 · 2020-11-06T07:13:06.563

@Radu - you're kind of asking this question on the wrong website (Stack Overflow is more for programming questions). Serverfault or Super User might have been more appropriate. At any rate I'll take a stab (but answers may be low quality because you are asking LOTS of questions at the same time so this is all I have time to answer):

There is a long time required for the FIO startup

When fio starts up, if the file you want to do I/O to doesn't exist (at at least the right size) then fio has to create it. The other thing fio does (if your platform supports it) is invalidate the cache of the file. If you've been queuing up a lot of cached writes that haven't been sent down to your disk it could take time for those to be flushed and for the cache to be dropped. Since I can't see your job file I can't really say more...

Is there a method to use FIO to keep track of latency changes over time. If so, could you please provide an example?

As you've found fio's summary output is cumulative so it's not that useful in your case. However you could just use fio's latency logging to record latency periodically (fio creates an entry for EVERY I/O by default so also see the log_avg_msec option and the Log File Formats section) and do post-processing yourself later (you might even be able to use fiologparser_hist.py).

For sequential writes, how could I increase throughput?

This is a huge topic in itself and I just can't do it justice here. Some starting points for you though: try switching to an asynchronous ioengine like libaio AND increasing the iodepth (e.g. to 32) AND setting direct=1. A bigger block size (e.g. 512k rather 4k) usually helps throughput too (but don't make it too large). Please re-read the help pages/HOWTO even though it's huge because some of the problems you are hitting are described within it (flexible also means complicated in this case...).

Would any of [python scripts in the FIO git repo for plotting ] be useful?

Yes? There are some shell based scripts (like fio2gnuplot) too. http://tfindelkind.com/2015/09/16/fio-flexible-io-tester-part9-fio2gnuplot-to-visualize-the-output/ gives an example. However if you understand the latency file created you may find it is easy to plot them in any spreadsheet or statistics tool of your choosing.

Another hint - try to ensure you are using a recent version of fio (see https://github.com/axboe/fio/releases for versions and it's a fairly easy build once you have the dependencies you need - https://github.com/axboe/fio/blob/fio-3.2/README#L130 ). The online HOWTO being linked is ONLY for the latest version of fio and many bugs are fixed that aren't in the stale versions of fio...

Good luck!

FIO latency percentile changes over time

1 Answers1