How to get IOStream to perform better?

Question

Most C++ users that learned C prefer to use the printf / scanf family of functions even when they're coding in C++.

Although I admit that I find the interface way better (especially POSIX-like format and localization), it seems that an overwhelming concern is performance.

Taking at look at this question:

How can I speed up line by line reading of a file

It seems that the best answer is to use fscanf and that the C++ ifstream is consistently 2-3 times slower.

I thought it would be great if we could compile a repository of "tips" to improve IOStreams performance, what works, what does not.

Points to consider

buffering (rdbuf()->pubsetbuf(buffer, size))
synchronization (std::ios_base::sync_with_stdio)
locale handling (Could we use a trimmed-down locale, or remove it altogether ?)

Of course, other approaches are welcome.

Note: a "new" implementation, by Dietmar Kuhl, was mentioned, but I was unable to locate many details about it. Previous references seem to be dead links.

I'm making this an FAQ question. Feel free to revert if you think this is wrong. — sbi, Mar 02 '11 at 10:43
@Matthieu: Dietmar once said that his work got abandoned, though I can't find where. (In general, you need to search the newsgroups to find this stuff. `comp.lang.c++.moderated` was where all the interesting C++ discussions took place in the 90s.) — sbi, Mar 02 '11 at 10:45
Is this factor also true for g++? I seem to remember that there has been work in the gnu stdlib implementation in order to remove unneeded performance hit. (I rarely do performance sensitive formatted IO, so I don't know). — AProgrammer, Mar 02 '11 at 10:49
@sbi, I'm pretty sure he stopped to work on it. The issue recently resurfaced on clc++m and he did participate. — AProgrammer, Mar 02 '11 at 10:50
@AProgrammer The performance difference is essentially an urban legend, fed by two facts: (1) Legacy implementation of the c++stdlib *were* slower. (2) Many people don’t know about `std::ios_base::sync_with_stdio`. — Konrad Rudolph, Mar 02 '11 at 11:42
@AProgrammer: I only constated a *17%* performance hit using gcc 3.4.2 on unix, after increasing the buffer size. — Matthieu M., Mar 02 '11 at 12:46
@AProgrammer: I've provided the code I used for benchmarking (in full), I am interested in results on other platforms if you have the occasions. From my measures it seems the default behavior on gcc/unix is already good to go, and no extra tuning is necessary. — Matthieu M., Mar 02 '11 at 13:54
@Konrad: If I debug into Dinkumware's streams implementation (one of the most widely distributed one) of the input operators, I will ultimately arrive at `scanf()`. Of course, since this is sharing all the disadvantages of `scanf()`, and adding a few layers on top, this stream implementation will, ultimately, be slower. And I'm _not_ talking disk IO here, but pure parsing. In theory, streams might even be faster than `printf()`/`scanf()`, but I've yet to encounter such an implementation in the wild. — sbi, Mar 02 '11 at 14:02
@AProgrammer: My comment was misleading. Yes, he stopped work on that many years ago. What I couldn't find was a posting of him where he explained why his work never got adopted. — sbi, Mar 02 '11 at 14:05
@sbi: the same problem occurs regularly in C++ I've found. Normally template programming could move checks from runtime to compile-time, but most of the times the C++ lib is a thin wrapper around the C one, which performs all checking at runtime anyway... — Matthieu M., Mar 02 '11 at 14:06
Matthieu, I used your same code, reduced the iterations to 1, use a large data file, and using "time" see 2x-3x difference between your cpp test and c test. — Bogatyr, Mar 02 '11 at 14:07
@sbi: do you still have his work around ? I could not even find archives of it, and his website seems to have been moved / shut down. — Matthieu M., Mar 02 '11 at 14:07
@sbi, Here is the message I was thinking of: http://groups.google.com/group/comp.lang.c++.moderated/msg/c213e6e7d75148f8 — AProgrammer, Mar 02 '11 at 14:14
@Matthieu, the link in the message I referenced above is alife here. — AProgrammer, Mar 02 '11 at 14:16
@Matthieu: I wasn't a workaround, but a full-blown streams implementation, which he claimed (I never tried it) to be faster than C IO. Google found it at http://www.dietmar-kuehl.de/cxxrt/. However, most of the source files are timestamped 2002, some 2003, so it really is outdated. — sbi, Mar 02 '11 at 14:16
@AProgrammer: That's not the message I was looking for, but it's pretty much the content I wanted. Thanks for posting it! — sbi, Mar 02 '11 at 14:19
@sbi: I didn't say workaround but "work" "around", which can be translated at "production" "somewhere", thanks for the link, I'll put it in my "things" to read :) — Matthieu M., Mar 02 '11 at 17:59

score 53 · Accepted Answer · edited Oct 04 '16 at 10:18

53

Here is what I have gathered so far:

Buffering:

If by default the buffer is very small, increasing the buffer size can definitely improve the performance:

it reduces the number of HDD hits
it reduces the number of system calls

Buffer can be set by accessing the underlying streambuf implementation.

char Buffer[N];

std::ifstream file("file.txt");

file.rdbuf()->pubsetbuf(Buffer, N);
// the pointer reader by rdbuf is guaranteed
// to be non-null after successful constructor

Warning courtesy of @iavr: according to cppreference it is best to call pubsetbuf before opening the file. Various standard library implementations otherwise have different behaviors.

Locale Handling:

Locale can perform character conversion, filtering, and more clever tricks where numbers or dates are involved. They go through a complex system of dynamic dispatch and virtual calls, so removing them can help trimming down the penalty hit.

The default C locale is meant not to perform any conversion as well as being uniform across machines. It's a good default to use.

Synchronization:

I could not see any performance improvement using this facility.

One can access a global setting (static member of std::ios_base) using the sync_with_stdio static function.

Measurements:

Playing with this, I have toyed with a simple program, compiled using gcc 3.4.2 on SUSE 10p3 with -O2.

C : 7.76532e+06
C++: 1.0874e+07

Which represents a slowdown of about 20%... for the default code. Indeed tampering with the buffer (in either C or C++) or the synchronization parameters (C++) did not yield any improvement.

Results by others:

@Irfy on g++ 4.7.2-2ubuntu1, -O3, virtualized Ubuntu 11.10, 3.5.0-25-generic, x86_64, enough ram/cpu, 196MB of several "find / >> largefile.txt" runs

C : 634572 C++: 473222

C++ 25% faster

@Matteo Italia on g++ 4.4.5, -O3, Ubuntu Linux 10.10 x86_64 with a random 180 MB file

C : 910390
C++: 776016

C++ 17% faster

@Bogatyr on g++ i686-apple-darwin10-g++-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664), mac mini, 4GB ram, idle except for this test with a 168MB datafile

C : 4.34151e+06
C++: 9.14476e+06

C++ 111% slower

@Asu on clang++ 3.8.0-2ubuntu4, Kubuntu 16.04 Linux 4.8-rc3, 8GB ram, i5 Haswell, Crucial SSD, 88MB datafile (tar.xz archive)

C : 270895 C++: 162799

C++ 66% faster

So the answer is: it's a quality of implementation issue, and really depends on the platform :/

The code in full here for those interested in benchmarking:

#include <fstream>
#include <iostream>
#include <iomanip>

#include <cmath>
#include <cstdio>

#include <sys/time.h>

template <typename Func>
double benchmark(Func f, size_t iterations)
{
  f();

  timeval a, b;
  gettimeofday(&a, 0);
  for (; iterations --> 0;)
  {
    f();
  }
  gettimeofday(&b, 0);
  return (b.tv_sec * (unsigned int)1e6 + b.tv_usec) -
         (a.tv_sec * (unsigned int)1e6 + a.tv_usec);
}


struct CRead
{
  CRead(char const* filename): _filename(filename) {}

  void operator()() {
    FILE* file = fopen(_filename, "r");

    int count = 0;
    while ( fscanf(file,"%s", _buffer) == 1 ) { ++count; }

    fclose(file);
  }

  char const* _filename;
  char _buffer[1024];
};

struct CppRead
{
  CppRead(char const* filename): _filename(filename), _buffer() {}

  enum { BufferSize = 16184 };

  void operator()() {
    std::ifstream file(_filename, std::ifstream::in);

    // comment to remove extended buffer
    file.rdbuf()->pubsetbuf(_buffer, BufferSize);

    int count = 0;
    std::string s;
    while ( file >> s ) { ++count; }
  }

  char const* _filename;
  char _buffer[BufferSize];
};


int main(int argc, char* argv[])
{
  size_t iterations = 1;
  if (argc > 1) { iterations = atoi(argv[1]); }

  char const* oldLocale = setlocale(LC_ALL,"C");
  if (strcmp(oldLocale, "C") != 0) {
    std::cout << "Replaced old locale '" << oldLocale << "' by 'C'\n";
  }

  char const* filename = "largefile.txt";

  CRead cread(filename);
  CppRead cppread(filename);

  // comment to use the default setting
  bool oldSyncSetting = std::ios_base::sync_with_stdio(false);

  double ctime = benchmark(cread, iterations);
  double cpptime = benchmark(cppread, iterations);

  // comment if oldSyncSetting's declaration is commented
  std::ios_base::sync_with_stdio(oldSyncSetting);

  std::cout << "C  : " << ctime << "\n"
               "C++: " << cpptime << "\n";

  return 0;
}

edited Oct 04 '16 at 10:18

asu

1,875
17
27

answered Mar 02 '11 at 13:52

Matthieu M.

287,565
48
449
722

1

Actually I found out that C++ is faster (g++ 4.4.5, -O3, Ubuntu Linux 10.10 x86_64): with a random 180 MB file I got `C: 910390 C++: 776016`. – Matteo Italia Mar 02 '11 at 14:01
@Matteo: Ah that's great. I need to try with g++4.3.2 as well. – Matthieu M. Mar 02 '11 at 14:08
The question that led to this one has nothing to do with preference, it has to do with concrete measurements of "typical" case input processing. Your benchmark is not really interesting, since it doens't meet a real world case. Instead, why don't you write a shell script that runs your program through 1 iteration on a set of large files, and measure the aggregate wallclock time. – Bogatyr Mar 02 '11 at 14:11
and 2nd, you need to break up the runs into: 1 run C case, 1 run C++ case, not putting them both together in the same executable. – Bogatyr Mar 02 '11 at 14:14
OK I ran your code as is, with the results (3 iterations): C : 4.34151e+06 C++: 9.14476e+06, g++ i686-apple-darwin10-g++-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664), mac mini, 4GB ram, idle except for this test. My data file is 168MB – Bogatyr Mar 02 '11 at 14:17
4

@Bogatyr `gettimeofday` is, if anything *more* precise than `time`. Furthermore, this *is* a good approximation of a real-world case: reading data. After all, we don’t want to measure other things, only the reading of data. So this benchmark is good. And putting both codes in the same executable is perfectly fine, too. Just make sure that enough iterations of the benchmark are run to offset warming-up slowdowns (or run it once at the beginning, which Mathieu does). This benchmark is *much* superior to your suggested “improvements”. – Konrad Rudolph Mar 02 '11 at 14:21
@Konrad At one iteration, it's OK, if the file is of a certain size. And my interest in this subject comes from a case were my "improvements" *were* the scenario -- algorithm competitions, where you have a very limited time to read in different, large-ish data sets, not the same data set over and over again. The fact is, on that site at least on that day, "cin >> s" lost severely to "scanf". On my mac mini with the stated g++, scanf wins big, too. However, on my ubuntu linux vmware on a windows 7 laptop with 4.4.1 "cin >> s" beats "scanf". So go figure, I'll agree it "does depend." – Bogatyr Mar 02 '11 at 15:12
@Bogatyr: I suspect that the difference is due to the improvements in g++; I don't see changes in the iostream implementation between g++ 4.2 and 4.4, but I notice that they improved many things in the optimizer, especially regarding inlining; with all the layers involved in iostream I think that changes to the inlining algorithms can really make a significant difference. – Matteo Italia Mar 03 '11 at 14:25
1

I just tested on 3 linux machines, compiled with g++ from 4.5.4 to 4.7.2, differences from 25% faster C++ to 40% faster C++. – Irfy Mar 19 '13 at 21:33
The program always runs `cread` before `cppread`, and they read the same file. Then the second one will benefit from the disk cache populated by the first one. – musiphil Oct 16 '13 at 06:29
@musiphil: note how `benchmark` is implemented, there is a first (not timed) dry run to warm up the cache, and *only then*, are there N runs (timed). – Matthieu M. Oct 16 '13 at 07:07
@musiphil: no complaint from me, it's *so* easy to have a meaningless benchmark program (because of optimization, cache warmup, ...) that I am grateful for additional pair of eyes scrutinizing this code. – Matthieu M. Oct 16 '13 at 14:59
@Matthieu Nice work. I was just experimenting with reading a large binary file, and looking for ways to control the buffer. I realized using `strace` that `file.rdbuf()->pubsetbuf()` was ignored in my case. Then I saw [here](http://en.cppreference.com/w/cpp/io/basic_filebuf/setbuf) that it should be called *before* opening the file, which you don't do in your benchmark. – iavr May 22 '15 at 23:21
@iavr: Interesting, it looks like a limitation of libstdc++. I am mildly annoyed by this, as RAII is all about opening first... Guess once wrapped properly it'll work better. – Matthieu M. May 23 '15 at 12:27
but cppreference says `file.rdbuf()->pubsetbuf(Buffer, N);` in base class does nothing - http://en.cppreference.com/w/cpp/io/basic_streambuf/pubsetbuf – hg_git Aug 30 '16 at 12:14
@hg_git: Specifically, cppreference mentions that the implementation of `std::basic_streambuf::pubsetbuf` does nothing, however `pubsetbuf` is a *virtual* method and is there specifically so that derived classes *can* (if they so wish) make it do something useful. It turns out that `ifstream` will yield a derived version of `basic_streambuf` which overrides `pubsetbuf`. – Matthieu M. Aug 30 '16 at 13:58
@MatthieuM. Thankyou :) where can I find about `basic_streambuf` overriding `pubsetbuf`? – hg_git Aug 31 '16 at 16:53
~100 MB file, `clang version 3.8.0-2ubuntu4` compiling with `-Os` : C : 278425, C++: 159543 - 75% improvement! Getting slightly worse results on `gcc`, speeding up a bit C and slowering up a bit C++, but by a small margin. – asu Oct 04 '16 at 09:55
@Asu: gcc and clang use different C++ standard libraries by default (libstdc++ and libc++ respectively) so this might be the cause of the difference you are observing. Thanks for this datapoint :) – Matthieu M. Oct 04 '16 at 10:19
@MatthieuM. - good point - I tried compiling with clang + libstdc++ and got C : 273557 - C++: 159604 Which is actually surprisingly even better C++ side. g++ : C : 267510 - C++: 172379 Nice to see how clang evolves. – asu Oct 04 '16 at 17:01
I actually removed the stdio sync and the buffering and didn't encounter significant performance impact. – asu Oct 04 '16 at 17:14

score 21 · Answer 2 · 2016-02-11T13:47:35.800

Two more improvements:

Issue `std::cin.tie(nullptr);` before heavy input/output.

Quoting http://en.cppreference.com/w/cpp/io/cin:

Once std::cin is constructed, std::cin.tie() returns &std::cout, and likewise, std::wcin.tie() returns &std::wcout. This means that any formatted input operation on std::cin forces a call to std::cout.flush() if any characters are pending for output.

You can avoid flushing the buffer by untying std::cin from std::cout. This is relevant with multiple mixed calls to std::cin and std::cout. Note that calling std::cin.tie(std::nullptr); makes the program unsuitable to run interactively by user, since output may be delayed.

Relevant benchmark:

File test1.cpp:

#include <iostream>
using namespace std;

int main()
{
  ios_base::sync_with_stdio(false);

  int i;
  while(cin >> i)
    cout << i << '\n';
}

File test2.cpp:

#include <iostream>
using namespace std;

int main()
{
  ios_base::sync_with_stdio(false);
  cin.tie(nullptr);

  int i;
  while(cin >> i)
    cout << i << '\n';

  cout.flush();
}

Both compiled by g++ -O2 -std=c++11. Compiler version: g++ (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4 (yeah, I know, pretty old).

Benchmark results:

work@mg-K54C ~ $ time ./test1 < test.in > test1.in

real    0m3.140s
user    0m0.581s
sys 0m2.560s
work@mg-K54C ~ $ time ./test2 < test.in > test2.in

real    0m0.234s
user    0m0.234s
sys 0m0.000s

(test.in consists of 1179648 lines each consisting only of a single 5. It’s 2.4 MB, so sorry for not posting it here.).

I remember solving an algorithmic task where the online judge kept refusing my program without cin.tie(nullptr) but was accepting it with cin.tie(nullptr) or printf/scanf instead of cin/cout.

Use `'\n'` instead of `std::endl`.

Quoting http://en.cppreference.com/w/cpp/io/manip/endl :

Inserts a newline character into the output sequence os and flushes it as if by calling os.put(os.widen('\n')) followed by os.flush().

You can avoid flushing the bufer by printing '\n' instead of endl.

Relevant benchmark:

File test1.cpp:

#include <iostream>
using namespace std;

int main()
{
  ios_base::sync_with_stdio(false);

  for(int i = 0; i < 1179648; ++i)
    cout << i << endl;
}

File test2.cpp:

#include <iostream>
using namespace std;

int main()
{
  ios_base::sync_with_stdio(false);

  for(int i = 0; i < 1179648; ++i)
    cout << i << '\n';
}

Both compiled as above.

Benchmark results:

work@mg-K54C ~ $ time ./test1 > test1.in

real    0m2.946s
user    0m0.404s
sys 0m2.543s
work@mg-K54C ~ $ time ./test2 > test2.in

real    0m0.156s
user    0m0.135s
sys 0m0.020s

Ah yes, the `endl` situation is usually well known by afficionados but so many tutorials use it by default (why????) that it trips beginners/medium level programmers regularly. As for `tie`: I am learning something today! I knew prompting the user would force a flush, but didn't know how it was controlled. — Matthieu M., Feb 11 '16 at 13:31

score 1 · Answer 3 · answered Mar 02 '11 at 11:29

1

Interesting you say C programmers prefer printf when writing C++ as I see a lot of code that is C other than using cout and iostream to write the output.

Uses can often get better performance by using filebuf directly (Scott Meyers mentioned this in Effective STL) but there is relatively little documentation in using filebuf direct and most developers prefer std::getline which is simpler most of the time.

With regards to locale, if you create facets you will often get better performance by creating a locale once with all your facets, keeping it stored, and imbuing it into each stream you use.

I did see another topic on this here recently, so this is close to being a duplicate.

answered Mar 02 '11 at 11:29

CashCow

30,981
5
61
92

If you get better performance by using a file buffer directly, then that means it's the parsing code (for reading, anyway) that's the performance hog, since this is what `std::istream` wraps the buffer with. Unfortunately, widespread IO stream implementations use `printf()`/`scanf()` under the hood, which certainly must be slower than using C std lib IO directly. (Also see my comment to @Konrad on the question.) – sbi Mar 02 '11 at 14:08
17

"code that is C other than using cout and iostream" - we call it "C with iostreams" and it is what passes for C++ in many university courses. – MaHuJa Oct 22 '11 at 23:52

How to get IOStream to perform better?

3 Answers3

Issue `std::cin.tie(nullptr);` before heavy input/output.

Use `'\n'` instead of `std::endl`.

Linked

How to get IOStream to perform better?

3 Answers3

Issue std::cin.tie(nullptr); before heavy input/output.

Use '\n' instead of std::endl.

Linked

Issue `std::cin.tie(nullptr);` before heavy input/output.

Use `'\n'` instead of `std::endl`.