How can I speed up line by line reading of an ASCII file? (C++)

Question

Here's a bit of code that is a considerable bottleneck after doing some measuring:

//-----------------------------------------------------------------------------
//  Construct dictionary hash set from dictionary file
//-----------------------------------------------------------------------------
void constructDictionary(unordered_set<string> &dict)
{
    ifstream wordListFile;
    wordListFile.open("dictionary.txt");

    std::string word;
    while( wordListFile >> word )
    {
        if( !word.empty() )
        {
            dict.insert(word);
        }
    }

    wordListFile.close();
}

I'm reading in ~200,000 words and this takes about 240 ms on my machine. Is the use of ifstream here efficient? Can I do better? I'm reading about mmap() implementations but I'm not understanding them 100%. The input file is simply text strings with *nix line terminations.

EDIT: Follow-up question for the alternatives being suggested: Would any alternative (minus increasing the stream buffer sizes) imply that I write a parser that examines each character for new-lines? I kind of like the simple syntax of streams, but I can re-write something more nitty-gritty if I have to for speed. Reading the entire file in to memory is a viable option, it's only about 2mb.

EDIT #2: I've found that the slow down for me was due to the set insert, but for those who are still interested in speeding up line by line file IO, please read the answers here AND check out Matthieu M.'s continuation on the topic.

Biggest gain regarding the reading is to use a large buffer so you don't go through all the O/S layers for every single line (or small buffer) and don't repeatedly pay the performance overhead penalty for that. Dunno what the API for increasing an ifstream's read buffer size is, though, that's why this is a comment, not an answer. But I am sure there is a method to resize the buffer (or assign your own, specifying its size). — TheBlastOne, Mar 02 '11 at 07:26
The first thing you should do when performance is a problem is to ___profile___. Have a look at where your code spends most of its time. For all we know, it might be in adding the value to your dictionary. Unlikely, I know, but if you need performance, you really need to know. — sbi, Mar 02 '11 at 08:35
@sbi: It's not that unlikely. In a high-performance statistics app, I've found the Boost version of `unordered_map` to be an order of magnitude slower than Google's `sparsehash` and not much faster than GNU `std::map`. — Fred Foo, Mar 02 '11 at 10:59
@sbi and @larsmans: You guys are right, see accepted answer. — Jon, Mar 02 '11 at 17:24

score 8 · Accepted Answer · answered Mar 02 '11 at 11:28

8

Quick profiling on my system (linux-2.6.37, gcc-4.5.2, compiled with -O3) shows that I/O is not the bottleneck. Whether using fscanf into a char array followed by dict.insert() or operator>> as in your exact code, it takes about the same time (155 - 160 ms to read a 240k word file).

Replacing gcc's std::unordered_set with std::vector<std::string> in your code drops the execution time to 45 ms (fscanf) - 55 ms (operator>>) for me. Try to profile IO and set insertion separately.

answered Mar 02 '11 at 11:28

Cubbi

46,567
13
103
169

I'm using unordered_set to speed up look-ups into the dictionary in another part of my code. But yes, this is the bottleneck indeed. It was quite premature for me to exclaim the stream to be the bottle neck... – Jon Mar 02 '11 at 17:20

Matthieu M. · Answer 2 · 2011-03-02T09:57:21.543

4

You could get better performance, normally, by increasing the buffer size.

Right after building the ifstream, you can set its internal buffer using:

char LocalBuffer[4096]; // buffer

std::ifstream wordListFile("dictionary.txt");

wordListFile.rdbuf()->pubsetbuf(LocalBuffer, 4096);

Note: rdbuf's result is guaranteed no to be null if the construction of ifstream succeeded

Depending on the memory available, your are strongly encouraged to grow the buffer if possible in order to limit interaction with the HDD and the number of system calls.

I've performed some simple measurements using a little benchmark of my own, you can find the code below (and I am interested in critics):

gcc 3.4.2 on SLES 10 (sp 3)
C : 9.52725e+06
C++: 1.11238e+07
difference: 1.59655e+06

Which gives a slowdown of a whooping 17%.

This takes into account:

automatic memory management (no buffer overflow)
automatic resources management (no risk to forget to close the file)
handling of locale

So, we can argue that streams are slow... but please, don't throw your random piece of code and complains it's slow, optimization is hard work.

Corresponding code, where benchmark is a little utility of my own which measure the time of a repeated execution (here launched for 50 iterations) using gettimeofday.

#include <fstream>
#include <iostream>
#include <iomanip>

#include <cmath>
#include <cstdio>

#include "benchmark.h"

struct CRead
{
  CRead(char const* filename): _filename(filename) {}

  void operator()()
  {
    FILE* file = fopen(_filename, "r");

    int count = 0;
    while ( fscanf(file,"%s", _buffer) == 1 ) { ++count; }

    fclose(file);
  }

  char const* _filename;
  char _buffer[1024];
};

struct CppRead
{
  CppRead(char const* filename): _filename(filename), _buffer() {}

  enum { BufferSize = 16184 };

  void operator()()
  {
    std::ifstream file(_filename);
    file.rdbuf()->pubsetbuf(_buffer, BufferSize);

    int count = 0;
    std::string s;
    while ( file >> s ) { ++count; }
  }

  char const* _filename;
  char _buffer[BufferSize];
};


int main(int argc, char* argv[])
{
  size_t iterations = 1;
  if (argc > 1) { iterations = atoi(argv[1]); }

  char const* filename = "largefile.txt";

  CRead cread(filename);
  CppRead cppread(filename);

  double ctime = benchmark(cread, iterations);
  double cpptime = benchmark(cppread, iterations);

  std::cout << "C  : " << ctime << "\n"
               "C++: " << cpptime << "\n";

  return 0;
}

edited Mar 02 '11 at 09:57

answered Mar 02 '11 at 08:12

Matthieu M.

287,565
48
449
722

"So, we can argue that streams are slow... but please, don't throw your random piece of code and complains it's slow, optimization is hard work." The point Matthieu, is the "hard work" you had to go through to get the C++ version to perform reasonably, and how surprising it is that the comparison of the one-liner read loops ends up performing so drastically differently. Good to know about the buffering, though. – Bogatyr Mar 02 '11 at 09:11
Interesting though, that on my little 1 million "howdythere" file test case, adding buffering made almost no difference, and the C version remains about 5x faster. – Bogatyr Mar 02 '11 at 09:16
By increasing the buffer size, you reduce the number of user/kernel interaction you need to read the file. That can be done in C as well with the `setbuffer` call, afaik. – PypeBros Mar 02 '11 at 09:21
@sylvainulg: well, if you are about to read a large file, it seems reasonnable to increase the buffer size. Thanks for the C tip, I didn't know about it. – Matthieu M. Mar 02 '11 at 09:24
@Bogatyr: I agree that it seems painful (especially given the extremely clumsy interface of streams), on the other hand, if you are memory constrained, a large buffer would be a disagreement as well. Did you add buffering to the C version as well ? I am by no mean a fan of the IO Streams library, but I have found that people coming from C were generally put off by its interface first and its performance second. – Matthieu M. Mar 02 '11 at 09:27
@Matthieu -- I in fact like streams, I don't use C i/o in my work typically, but I use streams heavily. There's no arguing about all the "goodies" that come with streams. The surprising thing, though, is the performance price for "simple" cases. I still can't get close to matching your 17% difference, though, I'll try your code verbatim. – Bogatyr Mar 02 '11 at 09:38
1

"your are strongly encouraged to grow the buffer in order to limit interaction with the HDD. 4096 corresponds to most HDDs sector's size." -- Technically, HDD sectors are 512 bytes, and only the OS tend to group them in 4K chunks so that it matches one virtual memory page easier. There are chance that your whole track, and not a mere sector, is optimistically cached at the OS level. Up to the size of the track (a priori unknown), all you observe is improvement by reducing syscalls, not interactions with the disk. So yes, your advise is wise on the thing to do and effect achieved, but imprecise – PypeBros Mar 02 '11 at 09:40
Using your code I still see c++ 2-3x slower than C when performing one iteration through a single large-ish file. – Bogatyr Mar 02 '11 at 09:54
@sylvainulg: thanks for the precision :) I removed the size bit and added that improvement was both from HDD and syscalls. @Bogatyr: I's bothering that there is so much difference between two implementations... I assume you do measure code compiled with optimizations on ? – Matthieu M. Mar 02 '11 at 09:58
hope your 'benchmark' (repeat 50 times etc) threw the first run time away or else the C version potentially has nothing in OS file cache, and the C++ does. – Jonathan Graehl Feb 13 '13 at 04:02
@JonathanGraehl: The number of times is given by the `iterations` parameter, and yes there is a dry-run to start with that is not taken into account. A better benchmark would imply cycling through the various versions multiple times and only taking the best or median result for each. – Matthieu M. Feb 13 '13 at 11:16

KęstutisV · Answer 3 · 2019-09-19T13:31:12.623

My system (3.2.0-52-generic, g++-4.7 (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04) 4.7.3, compiled with -O2 if not specified, CPU: i3-2125)

In my test cases I used 295068 words dictionary (so, there are 100k more words than in yours): http://dl.dropboxusercontent.com/u/4076606/words.txt

From time complexity point of view:

Worst case your program complexity: O(n*n)=O(n[read data]*n[insert into unordered_set])
Average case your program complexity: O(n)=O(n[read data]*1[insert into unordered_set])

Practical tips:

Most simple data structure have less time overhead. Simple array is faster than vector. char array is faster than string. There are plenty of info in the web about it.

My measurements:

Notice: I didn't flush my OS cache & HDD cache. The last one I can't control, but first one can be controlled with:

sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Also I didn't omit those measurements that included a lot of context-switches and so on. So, there is space to do better measurements.

My code (read from file & insert data; search all the words):

14–16 ms to read from file & insert data to a 2D char array (read & insert) n times

65-75 ms to search with binary search all the words (search n times):

Total=79-91 ms

61-78 ms to read from file & insert data to a unordered_set char array (read & insert) n times

7-9 ms to search by hash n times

Total=68-87 ms

If you search more times than you insert choose hash table (unordered_set) otherwise binary search (with simple array).

Your original code (search & insert):

Compiled with -O2: 157-182 ms

Compiled with -O0 (if you omit -O flag, "-O" level by default is also 0): 223-248 ms

So, compiler options also matters, in this case it means 66 ms speed boost. You didn't specified any of them. So, my best guess is you didn't used it. As I try to answer to your main question.

What you can do most simple, but better with your current code?

[better usage of high level API] Use "getline(wordListFile, word)" instead of "wordListFile >> word". Also I think "getline" is more readable than the ">>" operator.

Compiled with -O2: 142-170 ms. ~ 12-15 ms speed boost compared with your original code.

Compiled with -O0 (if you omit -O flag, "-O" level by default is also 0): 213-235 ms. ~ 10-13 ms speed boost compared with your original code.

[better usage of high level API] Avoid rehashing with "dict.reserve(XXXXXX);", @David Rodríguez - dribeas also mentioned about it. If your dictionary is static or if you can guess your dictionary size (for example by file size divided by average word length). First run without "reserve" and outputted bucket_count (cout << "bucket_count = " << dict.bucket_count() << "\n";), in my case it is 299951. Then I added "dict.reserve(299951);".

Compiled with -O2: 99-121-[137] ms. ~ 33-43-[49] ms speed boost compared with your original code.

What you can do more advanced to speed it up?

Implement your own hash function for your specific data input. Use char array instead of STL string. After you did it, only then write code with direct OS I/O. As you noticed (from my measurements also can be seen) that data structure is the bottleneck. If the media is very slow, but CPU is very fast, compress the file uncompress it in your program.

My code is not perfect but still it is better than anything can be seen above: http://pastebin.com/gQdkmqt8 (hash function is from the web, can be also done better)

Could you provide more details about for what system (one or range) do you plan to optimize?

Info of time complexities: Should be links... But I don't have so much reputation as I'm beginner in stackoverflow.

Is my answer still relevant to anything? Please, add a comment or vote as there is no PM as I see.

Jon Cage · Answer 4 · 2011-03-02T15:37:14.160

2

Reading the whole file in one go into memory and then operating on it in would probably be faster as it avoids repeatedly going back to the disk to read another chunk.

Is 0.25s actually a problem? If you're not intending on loading much larger files is there any need to make it faster if it makes it less readable?

edited Mar 02 '11 at 15:37

answered Mar 02 '11 at 07:24

Jon Cage

36,366
38
137
215

Reading the entire file ahead of time is not always a viable option. Doing line by line input is far, far faster using the plain C file I/O than the c++ streams, at least the last time I measured it. Counter to intuition, but measurement is king. – Bogatyr Mar 02 '11 at 07:31
Agreed, but the OP didn't mention any limitations like memory to worry about. – Jon Cage Mar 02 '11 at 07:33
This is a typical memory vs. speed tradeoff. So one option is to minimize the number of I/O operations and just read the thing in one go into a string, and then transform that string into an unordered set of words. – Mephane Mar 02 '11 at 07:57

score 2 · Answer 5 · edited Jun 20 '20 at 09:12

The C++ and C libraries read stuff off the disk equally fast and are already buffered to compensate for the disk I/O lag. You are not going to make it faster by adding more buffering.

The biggest difference is that C++ streams does a load of manipulations based on the locale. Character conversions/Punctuational etc/etc.

As a result the C libraries will be faster.

Replaced Dead Link

For some reason the linked question was deleted. So I am moving the relevant information here. The linked question was about hidden features in C++.

Though not techncially part of the STL.
The streams library is part of the standard C++ libs.

For streams:

Locales.

Very few people actually bother to learn how to correctly set and/or manipulate the locale of a stream.

The second coolest thing is the iterator templates.
Most specifically for me is the stream iterators, which basically turn the streams into very basic containers that can then be used in conjunction with the standard algorithms.

Examples:

Did you know that locales will change the '.' in a decimal number to any other character automatically.
Did you know that locales will add a ',' every third digit to make it easy to read.
Did you know that locales can be used to manipulate the text on the way through (ie conversion from UTF-16 to UTF-8 (when writting to a file).

etc.

Examples:

wrt locale --> is it also activated in binary mode ? During my benchmark I did not see any difference in performance between binary and non-binary. — Matthieu M., Mar 02 '11 at 09:22
@Matthieu M: Great question. I just stepped through the code in DevStudio and yes it does. The binary designation does not affect the use of locale. The only thing it does is change how the end of line sequence is treated (which is not part of the locale). — Martin York, Mar 02 '11 at 10:00
thanks :) It's consistent with my observation then. Do you know if there is a way to "disable" this locale ? The issue with locale is all those facets manipulations etc... I can see a way to inject a new locale, but I cannot see any way to remove it altogether, and I am not sure that putting a locale without any facet would actually improve performance much. — Matthieu M., Mar 02 '11 at 10:08
@Matthieu M: No I don't know how to disable them. But the codecvt facet has a single method(always_noconv) that indicates whether the other methods need to be called. I would think the other facet types have techniques that limit the number of virtual calls when nothing is being done. — Martin York, Mar 02 '11 at 11:29

score 1 · Answer 6 · answered Mar 02 '11 at 07:29

1

A proper implementation of the IO library would cache the data for you, avoiding excessive disk accesses and system calls. I recommend that you use a system-call level tool (e.g. strace if you're under Linux) to check what actually happens with your IO.

Obviously, dict.insert(xxx) could also be a nuisance if it doesn't allow O(1) insertion.

answered Mar 02 '11 at 07:29

PypeBros

2,607
24
37

'dict' is an unordered_set (hashed set) so it should be O(1) insert time. – Jon Mar 02 '11 at 07:36
I'll look into your strace suggestion. Thanks. – Jon Mar 02 '11 at 07:46
The point is not what a "proper" implementation would do, but what *existing* implementations *actually* do. Line by line reading in c++ streams is demonstrably, measurably slower than doing the same in C i/o. – Bogatyr Mar 02 '11 at 07:54
@Jon: Amortized O(1) does not mean that it cannot be made faster. In particular, adding a size hint during creation of the `unordered_set` can avoid intermediate memory allocations (which are expensive) and copying (from the old allocated set to the new). While theoretically with and without the hint it insertion is a O(1) operation, in real life the performance can be impacted by this. – David Rodríguez - dribeas Mar 02 '11 at 09:18

Bogatyr · Answer 7 · 2011-03-02T11:47:58.223

0

Believe it or not, the performance of the stdlib stream in reading data is far below that of the C library routines. If you need top IO read performance, don't use c++ streams. I discovered this the hard way on algorithm competition sites -- my code would hit the test timeout using c++ streams to read stdin, but would finish in plenty of time using plain C FILE operations.

Edit: Just try out these two programs on some sample data. I ran them on Mac OS X 10.6.6 using g++ i686-apple-darwin10-g++-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664) on a file with 1 million lines of "howdythere", and the scanf version runs consistently 5 times faster than the cin version:

#include <stdio.h>

int main()
{
    int count = 0;
    char buf[1024];
    while ( scanf("%s", buf) == 1 )
        ++ count;

    printf( "%d lines\n", count );
}

and

#include <iostream>

int main()
{
    char buf[1024];
    int count = 0;

    while ( ! std::cin.eof() )
    {
        std::cin.getline( buf, 1023 );
        if ( ! std::cin.eof() )
            ++count;
    }
   std::cout << count << " lines" << std::endl;
}

Edit: changed the data file to "howdythere" to eliminate the difference between the two cases. The timing results did not change.

Edit: I think the amount of interest (and the downvotes) in this answer shows how contrary to popular opinion the reality is. People just can't believe that the simple case of reading input in both C and streams can be so different. Before you downvote: go measure it yourself. The point is not to set tons of state (that nobody typically sets), but just the code that people most frequently write. Opinion means nothing in performance: measure, measure, measure is all that matters.

edited Mar 02 '11 at 11:47

answered Mar 02 '11 at 07:26

Bogatyr

19,255
7
59
72

This isn't necessarily true for all situations. For example in a algorithm competition I did, I had to read in large numbers of integers and the c++ streams actually worked faster than the c i/o. It depends heavily on the compiler, system and type of i/o. – flight Mar 02 '11 at 07:30
Who are you asking? If you're asking me, then I already answered that. (differences in compiler, system and type of i/o) – flight Mar 02 '11 at 07:32
I have similar experience with streams, so i would at least consider rewriting the function. I am almost certain that reading the whole file into a buffer and not using streams will outperform the current OP solution easily. – PeterK Mar 02 '11 at 07:34
@quasiverse I've found these results to be very consistent across different OSs and compilers. I'd be interested in seeing your code where streams outperformed C i/o because I haven't found such a case yet. – Bogatyr Mar 02 '11 at 07:50
Not sure if it will make a huge difference, but the C++ version isn't really equivalent to the C version. Replacing the whole `while` with `while (std::getline(std::cin, buf)) ++count;` may be faster. Also need to replace buf with an `std::string`. – Collin Dauphinee Mar 02 '11 at 07:59
@Bogatyr and @dauphic: Yes, I'd like to see it actually converted to a string, because that is my end goal here, to get a std::string objects from lines. – Jon Mar 02 '11 at 08:04
1

@Bogatyr: did you try increasing the stream buffer ? Because unlike the C version, your C++ version does not guarantee, at all, that the bytes will be read (from the disk) `1023` at a time... even though you only see chunks of that size. – Matthieu M. Mar 02 '11 at 08:15
@Jon -- the point is the performance of sucking in the data, you can read directly in to the storage area of a pre-initialized std::string if you like. But the entire process will go faster with C I/O, just try it and measure it yourself. – Bogatyr Mar 02 '11 at 08:16
@Matthieu the 1023 is a limit, not a chunk size. edit: ok it is a chunk size but it's immaterial because getline stops at the default delimiter which is newline – Bogatyr Mar 02 '11 at 08:20
@dauphic your suggestion runs slower than my example. – Bogatyr Mar 02 '11 at 08:25
@sylvainulg: Mostly it's because IO stream implementations implemented the simple way on top of C IO. If you throw away all the type info you have in `is >> an_int` and call `scanf()` underneath, then all you get is a wrapper around `scanf()`, and wrapping stuff takes its toll. OTOH, for `is >> an_int` the stream "knows" there's going to be an int to input, so in theory it could be faster. Dietmar Kühl used to have a streams implementation that tried to do that. Unfortunately, it got lost in the mists of time... – sbi Mar 02 '11 at 08:33
The C++ iostream does a lot of local manipulation that is not done in the C I/O lib – Martin York Mar 02 '11 at 08:46
I just compared the performance of reading the data from a `stringstream` to reading it from an `ifstream` - they are identical. So the problem is not the actual disk-IO. – Björn Pollex Mar 02 '11 at 08:49
@Bogatyr, @Space_C0wb0y: I've measured a **17%** slowdown (see answer for code) so, as I said, it's more a matter of buffering. When tuned properly, it's not that slower, though I of course wish it were faster :) – Matthieu M. Mar 02 '11 at 08:56
Prefer %1023s to make sure you don't over run that buffer. If you are not using the string then %*s means ignore the parameter. Or to ignore a line %*[^\n] – Martin York Mar 02 '11 at 09:04
@martin: Although wrapping has a cost, I doubt it can account for the 500% performance penalty mentioned. I'd expect the std::getline to be implemented with assumptions such as "you'll love to get your new line utf8-parsed, on a freshly allocated memory area", that are possibly unappropriate for the "read dictionary" case. That would allow us to understand "only use C++ streams if ..." rather than "use C I/Os if you want speed" (regardless of what you plan to do with the content). – PypeBros Mar 02 '11 at 10:57
On my computer (linux-2.6.37, gcc 4.5.2, compiled with -O3), I am getting on average, 155 ms using C IO (your fscanf() into a char array followed by `dict.insert(std::string(buffer))`) and 160 ms using OP's unmodified C++ code. Reading a file with 234,937 words in it. – Cubbi Mar 02 '11 at 11:22
@Cubbi well, that's apples and oranges. Pure reads vs. reading and inserting into a container? Can't conclude much about that. I took Matthieu's code, and run it unmodified except making it one iteration on a large-ish file (16M lines text file), and I still see C I/O 2-3x faster. – Bogatyr Mar 02 '11 at 11:27
@sylvainulg: I would agree (at 500%), there is more to it than just the locale stuff. – Martin York Mar 02 '11 at 11:33
First thing, disable cstdio synchronization, then measure again. Before doing that, this benchmark is irrelevant. – Konrad Rudolph Mar 02 '11 at 11:35
@Bogatyr: I've posted another question centered around getting IOStreams to be faster at http://stackoverflow.com/questions/5166263/how-to-get-iostream-to-perform-better/5168384#5168384 I'd really like to know your platform and compilers because it seems no one is able to reproduce your poor performances :/ – Matthieu M. Mar 02 '11 at 13:56

score 0 · Answer 8 · answered Mar 02 '11 at 07:41

Unfortunately, there's not much you can do to increase performance when using an fstream.

You may be able to get a very slight speed improvement by reading in larger chunks of the file and then parsing out single words, but this depends on how your fstream implementation does buffering.

The only way to get a big improvement is to use your OS's I/O functions. For example, on Windows, opening the file with the FILE_FLAG_SEQUENTIAL_SCAN flag may speed up reads, as well as using asynchronous reads to grab data from disk and parse it in parallel.

score 0 · Answer 9 · answered Mar 02 '11 at 08:54

If you really want fast, ditch istream and string and create a trivial class Read_Only_Text around const char* & size, then memory map the file and insert into unordered_set<Read_Only_Text> with references to the embedded strings. It will mean you needlessly keep the 2mb file even though your number of unique keys may be much less, but it'll be very, very fast to populate. I know this is a pain, but I've done it several times for various tasks and the results are very good.