C++ iostream vs. C stdio performance/overhead

Question

I'm trying to comprehend how to improve the performance of this C++ code to bring it on par with the C code it is based on. The C code looks like this:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

typedef struct point {
  double x, y;
} point_t;

int read_point(FILE *fp, point_t *p) {
  char buf[1024];
  if (fgets(buf, 1024, fp)) {
    char *s = strtok(buf, " ");
    if (s) p->x = atof(s); else return 0;
    s = strtok(buf, " ");
    if (s) p->y = atof(s); else return 0;
  }
  else
    return 0;
  return 1;
}

int main() {
  point_t p;
  FILE *fp = fopen("biginput.txt", "r");

  int i = 0;
  while (read_point(fp, &p))
    i++;

  printf("read %d points\n", i);
  return 0;
}

The C++ code looks like this:

#include <iostream>
#include <fstream>

using namespace std;

struct point {
  double x, y;
};

istream &operator>>(istream &in, point &p) {
  return in >> p.x >> p.y;
}

int main() {
  point p;
  ifstream input("biginput.txt");

  int i = 0;
  while (input >> p)
    i++;

  cout << "read " << i << " points" << endl;
  return 0;
}

I like that the C++ code is shorter and more direct, but when I run them both on my machine I get very different performance (both being run on the same machine against a 138 MB test file):

$ time ./test-c
read 10523988 points
    1.73 real         1.68 user         0.04 sys
# subsequent runs:
    1.69 real         1.64 user         0.04 sys
    1.72 real         1.67 user         0.04 sys
    1.69 real         1.65 user         0.04 sys

$ time ./test-cpp
read 10523988 points
   14.50 real        14.36 user         0.07 sys
# subsequent runs
   14.79 real        14.43 user         0.12 sys
   14.76 real        14.40 user         0.11 sys
   14.58 real        14.36 user         0.09 sys
   14.67 real        14.40 user         0.10 sys

Running either program many times in succession does not change the result that the C++ version is about 10x slower.

The file format is just lines of space-separated doubles, such as:

587.96 600.12
430.44 628.09
848.77 468.48
854.61 76.18
240.64 409.32
428.23 643.30
839.62 568.58

Is there a trick to reducing the overhead that I'm missing?

Edit 1: Making the operator inline seems to have had a very small but possibly detectable effect:

   14.62 real        14.47 user         0.07 sys
   14.54 real        14.39 user         0.07 sys
   14.58 real        14.43 user         0.07 sys
   14.63 real        14.45 user         0.08 sys
   14.54 real        14.32 user         0.09 sys

This doesn't really solve the problem.

Edit 2: I'm using clang:

$ clang --version
Apple LLVM version 7.0.0 (clang-700.0.72)
Target: x86_64-apple-darwin15.5.0
Thread model: posix

I'm not using any optimization level on either the C or C++ and they're both being compiled with the same version of Clang on my Mac. Probably the version that comes with Xcode (/usr/bin/clang) on OS X 10.11. I figured it would cloud the issue if I enable optimizations in one but not the other or use different compilers.

Edit 3: replacing istream &operator>> with something else

I've rewritten the istream operator to be closer to the C version, and it is improved, but I still see a ~5x performance gap.

inline istream &operator>>(istream &in, point &p) {
  string line;
  getline(in, line);

  if (line.empty())
    return in;

  size_t next = 0;
  p.x = stod(line, &next);
  p.y = stod(line.substr(next));
  return in;
}

Runs:

$ time ./test-cpp
read 10523988 points
    6.85 real         6.74 user         0.05 sys
# subsequently
    6.70 real         6.62 user         0.05 sys
    7.16 real         6.86 user         0.12 sys
    6.80 real         6.59 user         0.09 sys
    6.79 real         6.59 user         0.08 sys

Interestingly, compiling this with -O3 is a substantial improvement:

$ time ./test-cpp
read 10523988 points
    2.44 real         2.38 user         0.04 sys
    2.43 real         2.38 user         0.04 sys
    2.49 real         2.41 user         0.04 sys
    2.51 real         2.42 user         0.05 sys
    2.47 real         2.40 user         0.05 sys

Edit 4: Replacing body of istream operator>> with C stuff

This version gets quite close to the performance of C:

inline istream &operator>>(istream &in, point &p) {
  char buf[1024];
  in.getline(buf, 1024);
  char *s = strtok(buf, " ");
  if (s)
    p.x = atof(s);
  else
    return in;

  s = strtok(NULL, " ");
  if (s)
    p.y = atof(s);

  return in;
}

Timing it unoptimized gets us in the 2 second territory, where optimization puts it over the unoptimized C (optimized C still wins though). To be precise, without optimizations:

    2.13 real         2.08 user         0.04 sys
    2.14 real         2.07 user         0.04 sys
    2.33 real         2.15 user         0.05 sys
    2.16 real         2.10 user         0.04 sys
    2.18 real         2.12 user         0.04 sys
    2.33 real         2.17 user         0.06 sys

With:

    1.16 real         1.10 user         0.04 sys
    1.19 real         1.13 user         0.04 sys
    1.11 real         1.06 user         0.03 sys
    1.15 real         1.09 user         0.04 sys
    1.14 real         1.09 user         0.04 sys

The C with optimizations, just to do apples-to-apples:

    0.81 real         0.77 user         0.03 sys
    0.82 real         0.78 user         0.04 sys
    0.87 real         0.80 user         0.04 sys
    0.84 real         0.77 user         0.04 sys
    0.83 real         0.78 user         0.04 sys
    0.83 real         0.77 user         0.04 sys

I suppose I could live with this, but as a novice C++ user, I'm now wondering if:

Is it worth trying to do this another way? I'm not sure it matters what happens inside the istream operator>>.
Is there another way to build the C++ code that might perform better besides these three ways?
Is this idiomatic? If not, do most people just accept the performance for what it is?

Edit 5: This question is totally different from the answer about printf, I don't see how the linked question this is supposedly a duplicate of addresses any of the three points directly above this.

The examples are not comparing like with like. The C++ version reads every value from the stream object, and would be comparable with use of something like `fscanf("%f &f ", &p->x, &p->y)`. The C version is hand-crafted to beat performance of such a `fscanf()` call, by reading a line directly and then parsing the line. The C++ version could be similarly crafted to use `std::getline()` or `std::istream::getline()` and then parse the string. — Peter, Jun 18 '16 at 07:37
@Peter I don't think anybody uses fscanf in practice; it's very brittle. But I'm happy to make the suggested change to use `getline`—what is the next step after that, making a string stream and using `>>` from that instead of the iostream? — Daniel Lyons, Jun 18 '16 at 07:38
@Peter Please note I'm not trying to create a perfect benchmark here, I'm actually interested in accelerating some C++ code in practice, as a novice user of C++. — Daniel Lyons, Jun 18 '16 at 07:40
I was commenting on what your code is comparing - not advocating either use or avoidance of `fscanf()`. The C++ code you have is, however, functionally equivalent to `fscanf()` (apart from the need to parse a format string) albeit less brittle. One way to parse a string once read would be to use a `std::istrstream`. There are others. — Peter, Jun 18 '16 at 07:42
@DanielLyons I think that part of the issue is that in C, you are assuming only one delimiter and fixed buffer length. However, your C++ approach doesn't take that into account so it will try to recognize *any* delimiter within *any* length. Basically, the more you know about your data and the more you tell about it, the faster it will be. — Frederik.L, Jun 18 '16 at 07:43
Sorry, but comparing timings of unoptimized binaries is pretty meaningless. As a rule of thumb, c++ functions have often more layers of indirection which get optimized out completely by the compiler if you activate optimization. — MikeMB, Jun 18 '16 at 07:43
@Frederik.L I understand that I am not great at C++ but it would actually be helpful to me, perhaps even an answer to my question, to direct me to the actual solution. — Daniel Lyons, Jun 18 '16 at 07:44
@MikeMB I don't agree with your reasoning, but adding `-O3` does not appear to have made a significant difference, so it's a moot point. — Daniel Lyons, Jun 18 '16 at 07:45
If I was concerned with I/O performance, I'd certainly evaluate it. MikeMB's comment about optimisation is also relevant - C++ streams operations are effectively inlined (templated) code, so optimisation has more chance of achieving performance improvement than is the case in C (where the functions are precompiled in a library). — Peter, Jun 18 '16 at 07:47
@Peter in that case, what compiler flags would you suggest? `-O3` didn't help. — Daniel Lyons, Jun 18 '16 at 07:48
I'd change the code before bothering with optimisation settings. Changing optimisation settings doesn't tend to help much when comparing performance of functionally different code. — Peter, Jun 18 '16 at 07:53
@Peter I've tried your approach and it is an improvement but there is still a significant gap — Daniel Lyons, Jun 18 '16 at 08:02
@MikeMB compiling with `-O3` against @Peter's code runs about 2x slower, which helps close the gap significantly. — Daniel Lyons, Jun 18 '16 at 08:04
Blame your STL iostream implementation. There's nothing other to it. STL implementations are notorious for being less performant than their C counterparts, I have seen similar behavior in other circumstances. — cmaster - reinstate monica, Jun 18 '16 at 08:57

Frederik.L · Accepted Answer · 2016-06-18T08:55:14.273

4

What's causing a significant difference in performance is a significant difference in the overall functionality.

I will do my best to compare both of your seemingly equivalent approaches in details.

In C:

Looping

Read characters until a newline or end-of-file is detected or max length (1024) is reached
Tokenize looking for the hardcoded white-space delimiter
Parse into double without any questions

In C++:

Looping

Read characters until one of the default delimiters is detected. This isn't limiting the detection to your actual data pattern. It will check for more delimiters just in case. Overhead everywhere.
Once it found a delimiter, it will try to parse the accumulated string gracefully. It won't assume a pattern in your data. For example, if there is 800 consecutive numeric characters and isn't a good candidate for the type anymore, it must be able to detect that possibility by itself, so it adds some overhead for that.

One way to improve performance that I'd suggest is very near of what Peter said in above comments. Use getline inside operator>> so you can tell about your data. Something like this should be able to give some of your speed back, thought it's somehow like C-ing a part of your code back:

istream &operator>>(istream &in, point &p) {
    char bufX[10], bufY[10];
    in.getline(bufX, sizeof(bufX), ' ');
    in.getline(bufY, sizeof(bufY), '\n');
    p.x = atof(bufX);
    p.y = atof(bufY);
    return in;
}

Hope it's helpful.

Edit: applied nneonneo's comment

edited Jun 18 '16 at 08:55

answered Jun 18 '16 at 08:23

Frederik.L

5,522
2
29
41

Nit: you want `in.getline(bufX, sizeof(bufX), ' ');`. Also use `\n` for `bufY`. – nneonneo Jun 18 '16 at 08:52
@nneonneo Thanks for the input, I updated accordingly – Frederik.L Jun 18 '16 at 08:56
2

Everyone sort of piled on me as if I were trying to benchmark iostreams or as if I had said that the two programs were equivalent, but I never did say that, only that I had a C program and was trying to match it with a C++ program. I'm accepting and upvoting you because your code actually works and you aren't giving me too much irrelevant advice. It's as if you actually read my question. But this whole experience has been so overwhelmingly negative I'm not sure what to do. – Daniel Lyons Jun 19 '16 at 05:59
1

@DanielLyons Your question is interesting and I digged a little into it before answering. My overall advice is: take the approach that's easier for you to use and maintain. For that part, C++ approaches are often more flexible and may scale better, while C is often a beast that fits over your exact data structure in one exact moment. Whenever there is performance issues, go back with some evil mechanics in C and wrap them up into meaningful function names. – Frederik.L Jun 19 '16 at 06:07
This is great advice! I will keep it in mind. I admire the flexibility, I just have a problem where (at least on the file I/O side) it may be too expensive without "some evil mechanics." – Daniel Lyons Jun 19 '16 at 06:08
@DanielLyons you may also be able to recover some perf by decoupling the iostreams from stdio: http://www.cplusplus.com/reference/ios/ios_base/sync_with_stdio/ – kfsone Jul 25 '16 at 03:03
@kfsone this was suggested many other times in the comments above, but is not actually relevant to this situation – Daniel Lyons Jul 25 '16 at 03:19

MikeMB · Answer 2 · 2016-06-18T11:09:23.993

Update: I did some more testing and (if you have enough memory) there is a surprisingly simple solution that - at least on my machine with VS2015 - outperforms the c-solution: Just buffer the file in a stringstream.

ifstream input("biginput.txt");
std::stringstream buffer;
buffer << input.rdbuf();
point p;
while (buffer >> p) {
    i++
}

So the problem seems to be not so much related to the c++ streaming mechanism itself, but to the internals of ifstream in particular.

Here is my original (outdated) Answer: @Frederik already explained, that the performance mismatch is (at least partially) tied to a difference in functionality.

As to how to get the performance back: On my machine with VS2015 the following runs in about 2/3 of the time the C-solution requries (although, on my machine, there is "only" a 3x performance gap between your original versions to begin with):

istream &operator >> (istream &in, point &p) {
    thread_local std::stringstream ss;
    thread_local std::string s;

    if (std::getline(in, s)) {
        ss.str(s);
        ss >> p.x >> p.y;
    }
    return in;
}

I'm not too happy about the thread_local variables, but they are necessary to eliminate the overhead of repeatedly dynamic memory allocation.

I appreciate that you're trying to help here (and I especially appreciate your help in the comments above) but this feels to me like piling hack upon hack. I'm surprised to hear that the gap is so narrow on Windows though. Upvoting because you are actually trying to answer my question though. — Daniel Lyons, Jun 19 '16 at 06:02
I like this answer the most, seems to tackle the issue more directly. I'd like to suggest "disabling" caching on the C program by using open/read/close instead of fopen/fgets/fclose and see how that goes. — Spidey, Mar 30 '20 at 11:45

score 1 · Answer 3 · answered Jun 18 '16 at 09:33

1

As noted in the comments, make sure the actual algorithm for reading input is as good in C++ as in C. And make sure that you have std::ios::sync_with_stdio(false) so the iostreams are not slowed down by synching with C stdio.

But in my experience, C stdio is faster than C++ iostreams, but the C lib is not typesafe and extensible.

answered Jun 18 '16 at 09:33

Erik Alapää

2,585
1
14
25

[`std::ios::sync_with_stdio`](http://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdio) only affects standard input/output streams (like `std::cout/stdout`. – MikeMB Jun 18 '16 at 10:12
Yes, true. Knowing about sync_with_stdio is good, but if the program never will be modified to read from stdin, the syncing is not necessary to consider. – Erik Alapää Jun 18 '16 at 11:08
This is not an answer to the question I actually asked. – Daniel Lyons Jun 19 '16 at 06:00
Yes it is. Use a comparably good algorithm as in C, and even then, you should probably expect old, simple C stdio to be faster. – Erik Alapää Jun 19 '16 at 08:21

C++ iostream vs. C stdio performance/overhead

3 Answers3

Linked