The program below reads in a bunch of lines from a file and parses them. It could be faster. On the other hand, if I have several cores and several files to process, that shouldn't matter much; I can just run jobs in parallel.
Unfortunately, this doesn't seem to work on my arch machine. Running two copies of the program is only slightly (if at all) faster than running one copy (see below), and less than 20% of what my drive is capable of. On an ubuntu machine with identical hardware, the situation is a bit better. I get linear scaling for 3-4 cores but I still top out at about 50% of my SSD drive's capacity.
What obstacles prevent linear scaling of I/O throughput as the number of cores increases, and what can be done to improve I/O concurrency on the software/OS side?
P.S. - For the hardware alluded to below, a single core is fast enough that reading would be I/O bound if I moved parsing to a separate thread. There are also other optimizations for improving single-core performance. For this question, however, I'd like to focus on the concurrency and how my coding and OS choices affect it.
Details:
Here are a few lines of iostat -x 1
output:
Copying a file to /dev/null with dd:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 883.00 0.00 113024.00 0.00 256.00 1.80 2.04 2.04 0.00 1.13 100.00
Running my program:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 1.00 1.00 141.00 2.00 18176.00 12.00 254.38 0.17 1.08 0.71 27.00 0.96 13.70
Running two instances of my program at once, reading different files:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 11.00 0.00 139.00 0.00 19200.00 0.00 276.26 1.16 8.16 8.16 0.00 6.96 96.70
It's barely better! Adding more cores doesn't increase throughput, in fact it starts to degrade and become less consistent.
Here's one instance of my program and one instance of dd:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 9.00 0.00 468.00 0.00 61056.00 0.00 260.92 2.07 4.37 4.37 0.00 2.14 100.00
Here is my code:
#include <string>
#include <boost/filesystem/path.hpp>
#include <boost/algorithm/string.hpp>
#include <boost/filesystem/operations.hpp>
#include <boost/filesystem/fstream.hpp>
typedef boost::filesystem::path path;
typedef boost::filesystem::ifstream ifstream;
int main(int argc, char ** argv) {
path p{std::string(argv[1])};
ifstream f(p);
std::string line;
std::vector<boost::iterator_range<std::string::iterator>> fields;
for (getline(f,line); !f.eof(); getline(f,line)) {
boost::split (fields, line, boost::is_any_of (","));
}
f.close();
return 0;
}
Here's how I compiled it:
g++ -std=c++14 -lboost_filesystem -o gah.o -c gah.cxx
g++ -std=c++14 -lboost_filesystem -lboost_system -lboost_iostreams -o gah gah.o
Edit: Even more details
I clear memory cache (free page cache, dentries and inodes) before running the above benchmarks, to keep linux from pulling in pages from cache.
My process appears to be CPU-bound; switching to mmap or changing the buffer size via pubsetbuf have no noticable effect on recorded throughput.
On the other hand, scaling is IO-bound. If I bring all files into memory cache before running my program, throughput (now measured via execution time since iostat
can't see it) scales linearly with the number of cores.
What I'm really trying to understand is, when I read from disk using multiple sequential-read processes, why doesn't throughput scale linearly with the number of processes up to something close to the drive's maximum read speed? Why would I hit an I/O bound without saturating throughput, and how does when I do so depend on the OS/software stack over which I am running?