How does the compiler optimize getline() so effectively?

Question

I know a lot of a compiler's optimizations can be rather esoteric, but my example is so straightforward I'd like to see if I can understand, if anyone has an idea what it could be doing.

I have a 500 mb text file. I declare and initialize an fstream:

std::fstream file(path,std::ios::in)

I need to read the file sequentially. It's tab delimited but the field lengths are not known and vary line to line. The actual parsing I need to do to each line added very little time to the total (which really surprised me since I was doing string::find on each line from getline. I thought that'd be slow).

In general I want to search each line for a string, and abort the loop when I find it. I also have it incrementing and spitting out the line numbers for my own curiosity, I confirmed this adds little time (5 seconds or so) and lets me see how it blows past short lines and slows down on long lines.

I have the text to be found as the unique string tagging the eof, so it needs to search every line. I'm doing this on my phone so I apologize for formatting issues but it's pretty straightforward. I have a function taking my fstream as a reference and the text to be found as a string and returning a std::size_t.

long long int lineNum = 0;
while (std::getline (file, line))
{
    pos = line.find(text);
    lineNum += 1;
    std::cout << std::to_string(lineNum) << std::endl;
    if (pos != -1) 
        return file.tellg():
 }
     return std::string::npos;

Edit: lingxi pointed out the to_string isn't necessary here, thanks. As mentioned, entirely omitting the line number calculation and output saves a few seconds, which in my preoptimized example is a small percent of the total.

This successfully runs through every line, and returns the end position in 408 seconds. I had minimal improvement trying to put the file in a stringstream, or omitting everything in the whole loop (just getline until the end, no checks, searches, or displaying). Also pre-reserving a huge space for the string didn't help.

Seems like the getline is entirely the driver. However...if I compile with the /O2 flag (MSVC++) I get a comically faster 26 seconds. In addition, there is no apparent slow down on long lines vs short. Clearly the compiler is doing something very different. No complaints from me, but any thoughts as to how it's achieved? As an exercise I'd like to try and get my code to execute faster before compiler optimization.

I bet it has something to do with the way the getline manipulates the string. Would it be faster (alas can't test for a while) to just reserve the whole filesize for the string, and read character by character, incrementing my line number when I pass a /n? Also, would the compiler employ things like mmap?

UPDATE: I'll post code when I get home this evening. It looks like just turning off runtime checks dropped the execution from 400 seconds to 50! I tried performing the same functionality using raw c style arrays. I'm not super experienced, but it was easy enough to dump the data into a character array, and loop through it looking for newlines or the first letter of my target string.

Even in full debug mode it gets to the end and correctly finds the string in 54 seconds. 26 seconds with checks off, and 20 seconds optimized. So from my informal, ad-hoc experiments it looks like the string and stream functions are victimized by the runtime checks? Again, I'll double check when I get home.

You could just `std::cout << lineNum << std::endl;`.which could save you some time. — Lingxi, Apr 17 '15 at 15:25
This must be the first time I see someone call iostreams "fast". Possibilities include omitting debug checks (MSVC's standard library has a lot of those), better inlining, devirtualization. — T.C., Apr 17 '15 at 15:42
... and optimization of *your* code. For example, the assignment to `pos` (indeed the variable itself) is worthless in this code, and I would bet the optimizer will toss it completely. I wouldn't be surprised if the entire function were inlined. it isn't that complicated. The file IO isn't the only thing that can change when this is thrown at an optimizer. — WhozCraig, Apr 17 '15 at 15:48
"entirely omitting the line number calculation and output saves a few seconds" - just an FYI /- you're defeating any chance of multi-line buffering there - even if the output is sent to a file or device (`/dev/nul`?) - by using `std::endl` rather than `\n`. They'll end up flushing the same on many interactive terminals though, as they tend to be line buffered anyway. — Tony Delroy, Apr 17 '15 at 15:49
You might improve performance further by [disabling stdio sync](http://stackoverflow.com/a/6821165/10077). — Fred Larson, Apr 17 '15 at 15:51
Lots of speculation here. Why not run with a sampling profiler and find out for certain? (P.S. Every time I run my code under a profiler for the first time, I am surprised. Every time. Makes sense when you think about it... If I knew where the time was going, I would have optimized that part in the first place.) (P.P.S. mmap() would almost certainly be slower.) — Nemo, Apr 17 '15 at 16:11
Thanks guys.I was hung up on the huge time savings, but it's about 380 seconds over about 100,000 lines, so just 3 thousandths of a second per line. Small inefficiencies in the loop can easily add up. I'll do some experimenting. — mock_blatt, Apr 17 '15 at 16:12
@Nemo I know this is off topic for this question, but I'm curious why you believe `mmap` would be slower. — Mark Ransom, Apr 17 '15 at 16:25
I believe that the *compiler* has nothing to do with optimizing `std::getline`. The function is a library routine and already optimized, usually by the compiler writers or a platform guru. What you are observing is the performance qualities of the `getline` which usually don't have anything to do with the compiler. — Thomas Matthews, Apr 17 '15 at 16:32
@MarkRansom: See http://stackoverflow.com/a/6657136/768469. Also try a search for "mmap" and "slow". Torvalds's old quote comports with my experience, at least on Linux; I have never seen `mmap` work faster than `read`/`write` with modest-sized blocks, and I typically work with files in the 500+ gigabyte range. (Although I have not tried benchmarking `mmap` lately... It is possible that "transparent hugepages" would mitigate the bad effects, although last time I looked THP was only applied to anonymous pages.) — Nemo, Apr 17 '15 at 17:30
@Nemo thanks. I always assumed that `mmap` was just a thin layer over the demand paging used for swapping, so it would be optimized to the utmost extent. It never occurred to me that it would have so much overhead per block! — Mark Ransom, Apr 17 '15 at 20:38
@Mark: The problem is the VM management. The kernel deals directly with physical pages, so it can manage its memory (e.g. the page cache) very efficiently. But when it has to dork with a userspace process's page tables, you get all sorts of locking, TLB flushes, inter-processor interrupts (http://en.wikipedia.org/wiki/Inter-processor_interrupt), etc. This can easily add up to be more expensive than just copying into, say, a 12K buffer over and over, especially if that buffer is in L1 cache. Plus the mmap overheads hurt more when the system is running many processes and threads. — Nemo, Apr 17 '15 at 21:26
@MarkRansom: I added links to more Torvalds commentary in [my other answer](http://stackoverflow.com/a/6657136/768469). — Nemo, Apr 18 '15 at 00:31

Krzysztof Kosiński · Accepted Answer · 2015-04-17T18:20:39.537

1

The reason for this dramatic speedup is that the iostream class hierarchy is based on templates (std::ostream is actually a typedef of a template called std::basic_ostream), and a lot of its code is in headers. C++ iostreams take several function calls to process every byte in the stream. However, most of those functions are fairly trivial. By turning on optimization, most of these calls are inlined, exposing to the compiler the fact that std::getline essentially copies characters from one buffer to another until it finds a newline - normally this is "hidden" under several layers of function calls. This can be further optimized, reducing the overhead per byte by orders of magnitude.

Buffering behavior actually doesn't change between the optimized and non-optimized version, otherwise the speedup would be even higher.

edited Apr 17 '15 at 18:20

answered Apr 17 '15 at 18:13

Krzysztof Kosiński

4,225
2
18
23

An LTO compiler such as MSVC can also look at the whole program and change virtual function calls into direct calls. That helps iostreams a lot too. – Zan Lynx Apr 17 '15 at 18:51
"reason for this dramatic speedup" - what dramatic speedup? The Q describes the speedup from optimisation as "comical". Optimisation is indeed important to the performance of iostreams *if the I/O can be performed fast enough to stress the code*, but it's because the actual disk reads are done in the same-sized buffered chunks that the observed performance is relatively consistent for the OP. – Tony Delroy Apr 17 '15 at 23:52
If you read OP's numbers, using /O2 reduces the running time by a factor of 20. It's fairly clear from the rest of the post that OP did not expect such a large speedup, and "comically faster" is actually his weird way of saying "dramatically faster". – Krzysztof Kosiński Apr 19 '15 at 23:27
@KrzysztofKosiński: ahhh - I took the "I get a comically faster 26 seconds" to be a poorly worded claim that it was 26 seconds faster... see where you've been coming from at last! :-). Cheers. – Tony Delroy Apr 20 '15 at 01:53

How does the compiler optimize getline() so effectively?

1 Answers1