How to improve SSD I/O throughput concurrency in Linux

Question

The program below reads in a bunch of lines from a file and parses them. It could be faster. On the other hand, if I have several cores and several files to process, that shouldn't matter much; I can just run jobs in parallel.

Unfortunately, this doesn't seem to work on my arch machine. Running two copies of the program is only slightly (if at all) faster than running one copy (see below), and less than 20% of what my drive is capable of. On an ubuntu machine with identical hardware, the situation is a bit better. I get linear scaling for 3-4 cores but I still top out at about 50% of my SSD drive's capacity.

What obstacles prevent linear scaling of I/O throughput as the number of cores increases, and what can be done to improve I/O concurrency on the software/OS side?

P.S. - For the hardware alluded to below, a single core is fast enough that reading would be I/O bound if I moved parsing to a separate thread. There are also other optimizations for improving single-core performance. For this question, however, I'd like to focus on the concurrency and how my coding and OS choices affect it.

Details:

Here are a few lines of iostat -x 1 output:

Copying a file to /dev/null with dd:

Device:         rrqm/s   wrqm/s     r/s     w/s     rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00  883.00    0.00 113024.00     0.00   256.00     1.80    2.04    2.04    0.00   1.13 100.00

Running my program:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               1.00     1.00  141.00    2.00 18176.00    12.00   254.38     0.17    1.08    0.71   27.00   0.96  13.70

Running two instances of my program at once, reading different files:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda              11.00     0.00  139.00    0.00 19200.00     0.00   276.26     1.16    8.16    8.16    0.00   6.96  96.70

It's barely better! Adding more cores doesn't increase throughput, in fact it starts to degrade and become less consistent.

Here's one instance of my program and one instance of dd:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               9.00     0.00  468.00    0.00 61056.00     0.00   260.92     2.07    4.37    4.37    0.00   2.14 100.00

Here is my code:

#include <string>

#include <boost/filesystem/path.hpp>
#include <boost/algorithm/string.hpp>
#include <boost/filesystem/operations.hpp>
#include <boost/filesystem/fstream.hpp>

typedef boost::filesystem::path path;
typedef boost::filesystem::ifstream ifstream;

int main(int argc, char ** argv) {
  path p{std::string(argv[1])};
  ifstream f(p);
  std::string line;
  std::vector<boost::iterator_range<std::string::iterator>> fields;

  for (getline(f,line); !f.eof(); getline(f,line)) {
    boost::split (fields, line, boost::is_any_of (","));
  }
  f.close();
  return 0;
}

Here's how I compiled it:

g++ -std=c++14 -lboost_filesystem -o gah.o -c gah.cxx
g++ -std=c++14 -lboost_filesystem -lboost_system -lboost_iostreams -o gah gah.o

Edit: Even more details

I clear memory cache (free page cache, dentries and inodes) before running the above benchmarks, to keep linux from pulling in pages from cache.

My process appears to be CPU-bound; switching to mmap or changing the buffer size via pubsetbuf have no noticable effect on recorded throughput.

On the other hand, scaling is IO-bound. If I bring all files into memory cache before running my program, throughput (now measured via execution time since iostat can't see it) scales linearly with the number of cores.

What I'm really trying to understand is, when I read from disk using multiple sequential-read processes, why doesn't throughput scale linearly with the number of processes up to something close to the drive's maximum read speed? Why would I hit an I/O bound without saturating throughput, and how does when I do so depend on the OS/software stack over which I am running?

You're possibly running into I/O size-based differences. I'd recommend experimenting with different read buffer sizes. See http://stackoverflow.com/questions/5166263/how-to-get-iostream-to-perform-better and http://en.cppreference.com/w/cpp/io/basic_filebuf/setbuf You might even want to try benchmarking with the raw C-style syscalls `open()` and `read()`, passing different buffer sizes (and inter-`read()` delay times) via command-line arguments. — Andrew Henle, May 01 '17 at 20:57

score 2 · Answer 1 · answered May 01 '17 at 17:00

You're not comparing similar things.

You're comparing

Copying a file to /dev/dull with dd:

(I'll assume you meant /dev/null...)

with

int main(int argc, char ** argv) {
  path p{std::string(argv[1])};
  ifstream f(p);
  std::string line;
  std::vector<boost::iterator_range<std::string::iterator>> fields;

  for (getline(f,line); !f.eof(); getline(f,line)) {
    boost::split (fields, line, boost::is_any_of (","));
  }
  f.close();
  return 0;
}

The first just reads raw bytes without a care as to what they are and dumps them into the bit bucket. Your code reads by lines, which need to be identified, and then splits them into vector.

The way you're reading the data, you read a line, then spend time processing it. The dd command you compare your code to never spends time doing things other than reading data - it doesn't have to read then process then read then process...

Thanks for the typo catch! The use of dd was just to give an indication of what the drive is capable of. I can get performance very similar to dd by replacing that `boost::split` line with something like `count++` (to keep the loop from being optimized away), or by compiling with -O2. My minimal working example is, by design, slow enough not to be I/O bound; I had to make it do something to achieve that. — Tobias Hagge, May 01 '17 at 17:13
@TobiasHagge Are you trying to make your I/O scale, then? Such that you get 2x performance with two copies of your process running, and 3x with three copies? — Andrew Henle, May 01 '17 at 18:01

score 1 · Answer 2 · answered May 02 '17 at 01:37

I believe there were at least three issues at play here:

1) My reads were occurring too regularly.

The file I was reading from had predictible-length lines with predictably placed delimiters. By randomly introducing a 1 microsecond delay one time in a thousand, I was able to push throughput among multiple cores up to about 45MB/s.

2) My implementation of pubsetbuf did not actually set the buffer size.

The standard only specifies that pubsetbuf turn off buffering when a buffer size of zero is specified, as described in this link (thanks, @Andrew Henle); all other behavior is implementation-defined. Apparently my implementation used a buffer size of 8191 (verified by strace), regardless of what value I set.

Being too lazy to implement my own stream buffering for testing purposes, I rewrote the code to read 1000 lines into a vector, then attempt to parse them in a second loop, then repeat the whole procedure until end of file (there were no random delays). This allowed me to scale up to about 50MB/s.

3) My I/O scheduler and settings were not appropriate for my drive and application.

Apparently arch linux defaults to using the cfq io scheduler for my SSD drive, with parameters appropriate for HDD drives. Setting slice_sync to 0, as described here (see Mikko Rantalainen's answer, and the linked article), or switching to the noop scheduler, as described here, the original code gets around 60MB/s maximum throughput, running four cores. This link was also helpful.

With noop scheduling, scaling appears to be almost linear, up to my machine's four physical cores (I have eight with hyperthreading).

*By randomly introducing a 1 microsecond delay one time in a thousand, I was able to push throughput among multiple cores up to about 45MB/s.* OK, that is just weird. I'd love to figure out how **that** could happen. — Andrew Henle, May 02 '17 at 10:24

How to improve SSD I/O throughput concurrency in Linux

2 Answers2