10

Inspired by my previous question

A common mistake for new C++ programmers is to read from a file with something along the lines of:

std::ifstream file("foo.txt");
std::string line;
while (!file.eof()) {
  file >> line;
  // Do something with line
}

They will often report that the last line of the file was read twice. The common explanation for this problem (one that I have given before) goes something like:

The extraction will only set the EOF bit on the stream if you attempt to extract the end-of-file, not if your extraction just stops at the end-of-file. file.eof() will only tell you if the previous read hit the end-of-file and not if the next one will. After the last line has been extracted, the EOF bit is still not set and the iteration occurs one more time. However, on this last iteration, the extraction fails and line still has the same content as before, i.e. the last line is duplicated.

However, the first sentence of this explanation is wrong and so the explanation of what the code is doing is also wrong.

The definition of formatted input functions (which operator>>(std::string&) is) defines extraction as using rdbuf()->sbumpc() or rdbuf()->sgetc() to obtain input characters. It states that if either of these functions returns traits::eof(), then the EOF bit is set:

If rdbuf()->sbumpc() or rdbuf()->sgetc() returns traits::eof(), then the input function, except as explicitly noted otherwise, completes its actions and does setstate(eofbit), which may throw ios_base::failure (27.5.5.4), before returning.

We can see this with the simple example that uses a std::stringstream rather than a file (they are both input streams and behave the same way when extracting):

int main(int argc, const char* argv[])
{
  std::stringstream ss("hello");
  std::string result;
  ss >> result;
  std::cout << ss.eof() << std::endl; // Outputs 1
  return 0;
}

It's clear here that the single extraction obtains hello from the string and sets the EOF bit to 1.

So what's wrong with the explanation? What's different about files that causes !file.eof() to cause the last line to be duplicated? What's the real reason we shouldn't use !file.eof() as our extraction condition?

Community
  • 1
  • 1
Joseph Mansfield
  • 108,238
  • 20
  • 242
  • 324
  • A common mistake for new C++ programmers who read bad textbooks. – Cubbi Jan 30 '13 at 23:28
  • A common mistake is not checking every single stream operation: `if(!(stream>>var)) { doErrorHandling(); }` – CoffeDeveloper Oct 17 '15 at 13:51
  • @GameDeveloper That's excessive. If I'm reading into five variables in quick succession, and I only care whether they _all_ succeeded, I only need to check `stream` at the end. _Five_ separate checks there just makes a mess. – Lightness Races in Orbit Aug 03 '19 at 17:38

2 Answers2

20

Yes, extracting from an input stream will set the EOF bit if the extraction stops at the end-of-file, as demonstrated by the std::stringstream example. If it were this simple, the loop with !file.eof() as its condition would work just fine on a file like:

hello
world

The second extraction would eat world, stopping at the end-of-file, and consequently setting the EOF bit. The next iteration wouldn't occur.

However, many text editors have a dirty secret. They're lying to you when you save a text file even as simple as that. What they don't tell you is that there's a hidden \n at the end of the file. Every line in the file ends with a \n, including the last one. So the file actually contains:

hello\nworld\n

This is what causes the last line to be duplicated when using !file.eof() as the condition. Now that we know this, we can see that the second extraction will eat world stopping at \n and not setting the EOF bit (because we haven't gotten there yet). The loop will iterate for a third time but the next extraction will fail because it doesn't find a string to extract, only whitespace. The string is left with its previous value still hanging around and so we get the duplicated line.

You don't experience this with std::stringstream because what you stick in the stream is exactly what you get. There's no \n at the end of std::stringstream ss("hello"), unlike in the file. If you were to do std::stringstream ss("hello\n"), you'd experience the same duplicate line issue.

So of course, we can see that we should never use !file.eof() as the condition when extracting from a text file - but what's the real issue here? Why should we really never use that as our condition, regardless of whether we're extracting from a file or not?

The real problem is that eof() gives us no idea whether the next read will fail or not. In the above case, we saw that even though eof() was 0, the next extraction failed because there was no string to extract. The same situation would happen if we didn't associate a file stream with any file or if the stream was empty. The EOF bit wouldn't be set but there's nothing to read. We can't just blindly go ahead and extract from the file just because eof() isn't set.

Using while (std::getline(...)) and related conditions works perfectly because just before the extraction starts, the formatted input function checks if any of the bad, fail, or EOF bits are set. If any of them are, it immediately ends, setting the fail bit in the process. It will also fail if it finds the end-of-file before it finds what it wants to extract, setting both the eof and fail bits.


Note: You can save a file without the extra \n in vim if you do :set noeol and :set binary before saving.

Joseph Mansfield
  • 108,238
  • 20
  • 242
  • 324
  • 1
    A decent editor will not add a newline unless you tell it to. – Daniel Fischer Jan 30 '13 at 23:31
  • 6
    @DanielFischer, there are just as many bugs triggered by leaving the newline off the last line of a file as there are bugs triggered by having it there. The proper solution is to write programs that work either way. – Mark Ransom Jan 30 '13 at 23:35
  • 2
    A new line is required at the end of a file being read in text mode. – Pete Becker Jan 30 '13 at 23:37
  • 3
    @PeteBecker, do you have a reference to back that up? Since it is difficult to see visually that the last line is EOL terminated, such a rule would be unnecessarily harsh - you're just asking for bugs. – Mark Ransom Jan 30 '13 at 23:42
  • Related [Why should files end with a newline?](http://stackoverflow.com/questions/729692/why-should-files-end-with-a-newline). So I guess there is some truth to the fact that you should add a newline. However, I do not know of an editor that adds it for you. I think `What they don't tell you is that there's a hidden \n at the end of the file. ` is a false statement and a newline is only added if you add it yourself. – Jesse Good Jan 31 '13 at 00:37
  • 2
    @MarkRansom - it's ancient C, for compatibility with mainframes that have to force streams on top of record-oriented I/O. – Pete Becker Jan 31 '13 at 02:22
  • @JesseGood - Sublime Text puts newlines at the end of files. Seems to me that emacs does, too, or can be configured to do it, but I don't use emacs, so that's just an impression. – Pete Becker Jan 31 '13 at 02:23
  • 1
    @PeteBecker: Yeah, Vim also does it as mentioned in the answer, and this behavior seems more leftover from days when programs couldn't process the file correctly. At least the C++11 standard was smart enough to get rid of the rule requiring the newline (§2.2p1.2). – Jesse Good Jan 31 '13 at 02:50
4

Your question has some bogus conceptions. You give an explanation:

"The extraction will only set the EOF bit on the stream if you attempt to extract the end-of-file, not if your extraction just stops at the end-of-file."

Then claim it "is wrong and so the explanation of what the code is doing is also wrong."

Actually, it's right. Let's look at an example....

When reading into a std::string...

std::istringsteam iss('abc\n');
std::string my_string;
iss >> my_string;

...by default and as in your question operator>> is reading characters until it finds whitespace or EOF. So:

  • reading from 'abc\n' -> once the '\n' is encountered it doesn't "attempt to extract the end-of-file", rather it "just stops at [EOF]", and eof() won't return true,
  • reading from 'abc' instead -> it's the attempt to extract the end-of-file that discovers the end of the the string content, so eof() will return true.

Similarly, parsing '123' into an int sets eof() because the parsing doesn't know if there will be another digit and tries to keep reading them, hitting eof(). Parsing '123 ' to an int won't set eof().

Crucially, parsing 'a' into a char won't set eof() because trailing whitespace isn't needed to know that the parsing is complete - once a character is read no attempt is made to find another character and the eof() isn't encountered. (Of course further parsing from the same stream hits eof).

It's clear [for stringstream "hello" >> std::string] that the single extraction obtains hello from the string and sets the EOF bit to 1. So what's wrong with the explanation? What's different about files that causes !file.eof() to cause the last line to be duplicated? What's the real reason we shouldn't use !file.eof() as our extraction condition?

The reason is as above... that files tend to be terminated by a '\n' character, and when they are means getline or >> std::string return the last non-whitespace token without needing to "attempt to extract the end-of-file" (to use your phrase).

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252