Why is reading big text file in parallel bad?

Question

I have a big txt file with ~30 millions rows, each row is seperated by a line seperator \n. And I'd like to read all lines to an unordered list (e.g. std::list<std::string>).

std::list<std::string> list;
std::ifstream file(path);
while(file.good())
{
    std::string tmp;
    std::getline(file, tmp);
    list.emplace_back(tmp);
}
process_data(list);

The current implementation is very slow, so I'm learning how to read data by chunk.

But after seeing this comment:

parallelizing on a HDD will make things worse, with the impact depending on the distribution of the files on the HDD. On a SSD it might (!) improve things.

Is it bad to read a file in parallel? What's the algorithm to read all lines of a file to an unordered container (e.g. std::list, normal array,...) as fast as possible, without using any libraries, and the code must be cross-platform?

Imagine how HDD work. (note that use `file.good()` is wrong) — user202729, Feb 10 '18 at 10:58
The bottleneck is the disk, running multiple threads won't make the disk any faster. — Barmar, Feb 10 '18 at 10:58
"*as fast as possible*" do not read line-oriented, but do a binary read using larger buffers? — alk, Feb 10 '18 at 10:59
Also, all the threads are trying to write to the same list, contention will be a problem. — Barmar, Feb 10 '18 at 11:00
@Barmar - Depends on the technology involved. Consider RAID or flash drives. Exadata ? — Ed Heal, Feb 10 '18 at 11:00
Do some simple maths, how fast are you reading data off the disk vs quoted average disk transfer rate. If you are getting anywhere near the quoted rate that's it for performance. If you are not you can try reading the file in big blocks (measure again), then add breaking the blocks in to lines (measure again), then add the lines to a list. — Richard Critten, Feb 10 '18 at 11:02
(side note: make sure that everything (your "large buffer") fits into your RAM. Otherwise you're killing performance) — user202729, Feb 10 '18 at 11:07
The OS and standard library are pretty good at optimising sequential read of a whole file, because it's such a common case. Have you measured time to read vs time to allocate all the memory for the list? — Alan Stokes, Feb 10 '18 at 11:30
If you cared about performance, you wouldn't be using a `std::list`. — Veedrac, Feb 10 '18 at 12:21
@HighPerformanceMark My question was a comment from that link. That comment had some vague meaning so I make a new topic asking for what it really meant. So this is not a duplicate. — MiP, Feb 10 '18 at 13:41
@Veedrac Could you tell me what should I use instead, a `std::deque` or `std::vector`? After loading the file to this `std::list`, it should only be used for linear searching, no more inserting or removing. Also, what's wrong with performance when using `std::string` for storing a line of text? I used it because `std::getline` need `std::string` for the 2nd argument. — MiP, Feb 10 '18 at 13:47
If you want something fast, I expect you'd do best with a single character array (any contiguous backing store is appropriate) and a `std::vector` to enumerate the lines. Though that's fairly redundant, so maybe a `std::vector` would work better, though it'd be trickier to use. — Veedrac, Feb 10 '18 at 13:55
`std::list` is slow as molasses, and `std::string` is only really appropriate for small stringy temporaries (and even then it's mostly as it is so programmers can get away with being lazy about memory). Reading files with `std::getline` is slow; use something chunked like from the question you linked. — Veedrac, Feb 10 '18 at 13:56
If the bottleneck is in disk I/O then trying to multithread with the hopes of making disk I/O itself go faster is generally not going to help much if any. That said, for text files, often your bottlenecks will be in parsing rather than disk I/O. There you have a good chance of making things faster if you, say, read big buffer to `vector` and start using threads to look for line endings, figure out integer ranges for each line, and parse each line. I use that strategy to parse wavefront OBJ files in parallel and was able to halve the times to load a mesh [...] — , Feb 14 '18 at 05:15
[...] through multithreaded parsing (but again I didn't do it to accelerate disk I/O -- my hotspots were in funcs like `sscanf`). I used a kind of "look back" strategy where each thread is assigned a start position into a big character buffer but works backwards to figure out where the first EOL is for that given thread. Then it works forward and begins parsing. However, the last line, if it spills into the next thread's buffer range, is considered to belong to the next thread. "This" thread only considers it is own line if the EOL is right at the end of its assigned buffer range. — , Feb 14 '18 at 05:18
Echoing `Veedrac`, avoid `std::string` and I/O streams for this if you want a fast solution. Work with char buffers (ex: stored in giant vectors) and C functions like `sscanf`, `memchr`, `strchr`, etc.. As for the disk I/O, I do things like read 64 megabytes into a char buffer, have the threads parse it.... then the last line in that buffer might be incomplete... — , Feb 14 '18 at 05:22
... so then I keep track of the last line position and the trailing characters, then add them to the next char buffer, read in 64 more megabytes to that (the OBJ files I work with are sometimes over 10 gigabytes), then throw away the old buffer and fire up the threads again using parallel loops to parse lines in that buffer. — , Feb 14 '18 at 05:26

score 4 · Answer 1 · 2018-02-14T08:08:27.000

Is it bad to read a file in parallel? What's the algorithm to read all lines of a file to an unordered container (e.g. std::list, normal array,...) as fast as possible, without using any libraries, and the code must be cross-platform?

I guess I'll attempt to answer this one to avoid spamming the comments. I have, in multiple scenarios, sped up text file parsing substantially using multithreading. However, the keyword here is parsing, not disk I/O (though just about any text file read involves some level of parsing). Now first things first:

VTune here was telling me that my top hotspots were in parsing (sorry, this image was taken years ago and I didn't expand the call graph to show what inside obj_load was taking most of the time, but it was sscanf). This profiling session actually surprised me quite a bit. In spite of having been profiling for decades to the point where my hunches aren't too inaccurate (not accurate enough to avoid profiling, mind you, not even close, but I've tuned my sort of intuitive spider senses enough to where profiling sessions usually don't surprise me that much even without any glaring algorithmic inefficiencies -- though I might still be off about exactly why they exist since I'm not so good at assembly).

Yet this time I was really taken back and shocked so this example has always been one I used to show even the most skeptical colleagues who don't want to use profilers to show why profiling is so important. Some of them are actually good at guessing where hotspots exists and some were actually creating very competent-performing solutions in spite of never having used them, but none of them were good at guessing what isn't a hotspot, and none of them could draw a call graph based on their hunches. So I always liked to use this example to try to convert the skeptics and get them to spend a day just trying out VTune (and we had a boatload of free licenses from Intel who worked with us which were largely going to waste on our team which I thought was a tragedy since VTune is a really expensive piece of software).

And the reason I was taken back this time was not because I was surprised by the sscanf hotspot. That's kind of a no-brainer that non-trivial parsing of epic text files is going to generally be bottlenecked by string parsing. I could have guessed that. My colleagues who never touched a profiler could have guessed that. What I couldn't have guessed was how much of a bottleneck it was. I thought given the fact that I was loading millions of polygons and vertices, texture coordinates, normals, creating edges and finding adjacency data, using index FOR compression, associating materials from the MTL file to the polygons, reverse engineering object normals stored in the OBJ file and consolidating them to form edge creasing, etc. I would at least have a good chunk of the time distributed in the mesh system as well (I would have guessed 25-33% of the time spent in the mesh engine).

Turned out the mesh system took barely any time to my most pleasant surprise, and there my hunches were completely off about it specifically. It was, by far, parsing that was the uber bottleneck (not disk I/O, not the mesh engine).

So that's when I applied this optimization to multithread the parsing, and there it helped a lot. I even initially started off with a very modest multithreaded implementation which barely did any parsing except scanning the character buffers for line endings in each thread just to end up parsing in the loading thread, and that already helped by a decent amount (reduced the operation from 16 seconds to about 14 IIRC, and I eventually got it down to ~8 seconds and that was on an i3 with just two cores and hyperthreading). So anyway, yeah, you can probably make things faster with multithreaded parsing of character buffers you read in from text files in a single thread. I wouldn't use threads as a way to make disk I/O any faster.

I'm reading the characters from the file in binary into big char buffers in a single thread, then, using a parallel loop, have the threads figure out integer ranges for the lines in that buffer.

// Stores all the characters read in from the file in big chunks.
// This is shared for read-only access across threads.
vector<char> buffer;

// Local to a thread:
// Stores the starting position of each line.
vector<size_t> line_start;
// Stores the assigned buffer range for the thread:
size_t buffer_start, buffer_end;

Basically like so:

LINE1 and LINE2 are considered to belong to THREAD 1, while LINE3 is considered to belong to THREAD 2. LINE6 is not considered to belong to any thread since it doesn't have an EOL. Instead the characters of LINE6 will be combined with the next chunky buffer read from the file.

Each thread begins by looking at the first character in its assigned character buffer range. Then it works backwards until it finds an EOL or reaches the beginning of the buffer. After that it works forward and parses each line, looking for EOLs and doing whatever else we want, until it reaches the end of its assigned character buffer range. The last "incomplete line" is not processed by the thread, but instead the next thread (or if the thread is the last thread, then it is processed on the next big chunky buffer read by the first thread). The diagram is teeny (couldn't fit much) but I read in the character buffers from the file in the loading thread in big chunks (megabytes) before the threads parse them in parallel loops, and each thread might then parse thousands of lines from its designated buffer range.

std::list<std::string> list;
std::ifstream file(path);

while(file.good())
{
    std::string tmp;
    std::getline(file, tmp);
    list.emplace_back(tmp);
}
process_data(list);

Kind of echoing Veedrac's comments, storing your lines in std::list<std::string> if you want to really load an epic number of lines quickly is not a good idea. That would actually be a bigger priority to address than multithreading. I'd turn that into just std::vector<char> all_lines storing all the strings, and you can use std::vector<size_t> line_start to store the starting line position of an nth line, which you can retrieve like so:

// note that 'line' will be EOL-terminated rather than null-terminated 
// if it points to the original buffer.
const char* line = all_lines.data() + line_start[n];

The immediate problem with std::list without a custom allocator is a heap allocation per node. On top of that we're wasting memory storing two extra pointers per line. std::string is problematic here because SBO optimizations to avoid heap allocation would make it take too much memory for small strings (and thereby increase cache misses) or still end up invoking heap allocations for every non-small string. So you end up avoiding all these problems just storing everything in one giant char buffer, like in std::vector<char>. I/O streams, including stringstreams and functions like getline, are also horrible for performance, just awful, in ways that really disappointed me at first since my first OBJ loader used those and it was over 20 times slower than the second version where I ported all those I/O stream operators and functions and use of std::string to make use of C functions and my own hand-rolled stuff operating on char buffers. When it comes to parsing in performance-critical contexts, C functions like sscanf and memchr and plain old character buffers tend to be so much faster than the C++ ways of doing it, but you can at least still use std::vector<char> to store huge buffers, e.g., to avoid dealing with malloc/free and get some debug-build sanity checks when accessing the character buffer stored inside.

Why is reading big text file in parallel bad?

1 Answers1