What are the guidelines regarding parsing with iostreams?

Question

I found myself writing a lot of parsing code lately (mostly custom formats, but it isn't really relevant).

To enhance reusability, I chose to base my parsing functions on i/o streams so that I can use them with things like boost::lexical_cast<>.

I however realized I have never read anywhere anything about how to do that properly.

To illustrate my question, lets consider I have three classes Foo, Bar and FooBar:

A Foo is represented by data in the following format: string(<number>, <number>).

A Bar is represented by data in the following format: string[<number>].

A FooBar is kind-of a variant type that can hold either a Foo or a Bar.

Now let's say I wrote an operator>>() for my Foo type:

istream& operator>>(istream& is, Foo& foo)
{
    char c1, c2, c3;
    is >> foo.m_string >> c1 >> foo.m_x >> c2 >> std::ws >> foo.m_y >> c3;

    if ((c1 != '(') || (c2 != ',') || (c3 != ')'))
    {
      is.setstate(std::ios_base::failbit);
    }

    return is;
}

The parsing goes fine for valid data. But if the data is invalid:

foo might be partially modified;
Some data in the input stream was read and is thus no longer available to further calls to is.

Also, I wrote another operator>>() for my FooBar type:

istream& operator>>(istream& is, FooBar foobar)
{
  Foo foo;

  if (is >> foo)
  {
    foobar = foo;
  }
  else
  {
    is.clear();

    Bar bar;

    if (is >> bar)
    {
      foobar = bar;
    }
  }

  return is; 
}

But obviously it doesn't work because if is >> foo fails, some data has already been read and is no longer available for the call to is >> bar.

So here are my questions:

Where is my mistake here ?
Should one write his calls to operator>> to leave the initial data still available after a failure ? If so, how can I do that efficiently ?
If not, is there a way to "store" (and restore) the complete status of an input stream: state and data ?
What differences are they between failbit and badbit ? When should we use one or the other ?
Is there any online reference (or a book) that explains deeply how to deal with iostreams ? not just the basic stuff: the complete error handling.

Thank you very much.

@KerrekSB: Is Boost.Qi the same thing than Boost.Spirit ? If so, I already gave it a try. While I was astonished by the design of the library, it significantly increased my compilation times to the point of making my compiler crash under some conditions. Anyway, my question is more about learning the good practices than having a practical solution. Thank you though. — ereOn, Jan 11 '12 at 16:36
Yes, I think it's the same, though I really don't see through the jungle of spirit/karma/qi/phoenix/fusion... and you definitely want precompiled headers with Boost, that much is certain! Qi just struck me as a rather elegant solution to something that needs lots of hand-written wheel reinvention otherwise. — Kerrek SB, Jan 11 '12 at 16:43
Yes, Qi is the parsing sublibrary of Boost.Spirit (Karma is the generator sublibrary). In any case, agreed that precompiled headers are an absolute must. — ildjarn, Jan 11 '12 at 17:11
["Semantics of flags on basic_ios"](http://stackoverflow.com/questions/4258887/semantics-of-flags-on-basic-ios) covers the semantics of `failbit` and `badbit`. — James McNellis, Jan 11 '12 at 17:52
@ereOn: I didn't use pre-compilers headers with Qi, but I definitely suggest creating small functions in individual files to isolate the need for recompilation. — Matthieu M., Jan 11 '12 at 17:59
@MatthieuM.: I never thought of using precompiled headers for Boost but it makes perfect sense. I indeed had to create small functions in separate translation units to avoid the compilation issues. — ereOn, Jan 12 '12 at 08:19

score 4 · Accepted Answer · answered Jan 12 '12 at 00:17

Personally, I think these are reasonable questions and I remember very well that I struggled with them myself. So here we go:

Where is my mistake here ?

I wouldn't call it a mistake but you probably want to make sure you don't have to back off from what you have read. That is, I would implement three versions of the input functions. Depending on how complex the decoding of a specific type is I might not even share the code because it might be just a small piece anyway. If it is more than a line or two probably would share the code. That is, in your example I would have an extractor for FooBar which essentially reads the Foo or the Bar members and initializes objects correspondingly. Alternatively, I would read the leading part and then call a shared implementation extracting the common data.

Let's do this exercise because there are a few things which may be a complication. From your description of the format it isn't clear to me if the "string" and what follows the string are delimited e.g. by a whitespace (space, tab, etc.). If not, you can't just read a std::string: the default behavior for them is to read until the next whitespace. There are ways to tweak the stream into considering characters as whitespace (using std::ctype<char>) but I'll just assume that there is space. In this case, the extractor for Foo could look like this (note, all code is entirely untested):

std::istream& read_data(std::istream& is, Foo& foo, std::string& s) {
    Foo tmp(s);
    if (is >> get_char<'('> >> tmp.m_x >> get_char<','> >> tmp.m_y >> get_char<')'>)
        std::swap(tmp, foo);
    return is;
}
std::istream& operator>>(std::istream& is, Foo& foo)
{
    std::string s;
    return read_data(is >> s, foo, s);
}

The idea is that read_data() read the part of a Foo which is different from Bar when reading a FooBar. A similar approach would be used for Bar but I omit this. The more interesting bit is the use of this funny get_char() function template. This is something called a manipulator and is just a function taking a stream reference as argument and returning a stream reference. Since we have different characters we want to read and compare against, I made it a template but you can have one function per character as well. I'm just too lazy to type it out:

template <char Expect>
std::istream& get_char(std::istream& in) {
    char c;
    if (in >> c && c != 'e') {
        in.set_state(std::ios_base::failbit);
    }
    return in;
}

What looks a bit weird about my code is that there are few checks if things worked. That is because the stream would just set std::ios_base::failbit when reading a member failed and I don't really have to bother myself. The only case where there is actually special logic added is in get_char() to deal with expecting a specific character. Similarly there is no skipping of whitespace characters (i.e. use of std::ws) going on: all the input functions are formatted input functions and these skip whitespace by default (you can turn this off by using e.g. in >> std::noskipws) but then lots of things won't work.

With a similar implementation for reading a Bar, reading a FooBar would look something like this:

std::istream& operator>> (std::istream& in, FooBar& foobar) {
    std::string s;
    if (in >> s) {
         switch ((in >> std::ws).peek()) {
         case '(': { Foo foo; read_data(in, foo, s); foobar = foo; break; }
         case '[': { Bar bar; read_data(in, bar, s); foobar = bar; break; }
         default: in.set_state(std::ios_base::failbit);
         }
    }
    return in;
 }

This code uses an unformatted input function, peek() which just looks at the next character. It either return the next character or it returns std::char_traits<char>::eof() if it fails. So, if there is either an opening parenthesis or an opening bracket we have read_data() take over. Otherwise we always fail. Solved the immediate problem. On to distributing information...

Should one write his calls to operator>> to leave the initial data still available after a failure ?

The general answer is: no. If you failed to read something went wrong and you give up. This might mean that you need to work harder to avoid failing, though. If you really need to back off from the position you were at to parse your data, you might want to read data first into a std::string using std::getline() and then analyze this string. Use of std::getline() assumes that there is a distinct character to stop at. The default is newline (hence the name) but you can use other characters as well:

std::getline(in, str, '!');

This would stop at the next exclamation mark and store all characters up to it in str. It would also extract the termination character but it wouldn't store it. This makes it interesting sometimes when you read the last line of a file which may not have a newline: std::getline() succeeds if it can read at least one character. If you need to know if the last character in a file is a newline, you can test if the stream reached:

if (std::getline(in, str) && in.eof()) { std::cout << "file not ending in newline\"; }

If so, how can I do that efficiently ?

Streams are by their very nature single pass: you receive each character just once and if you skip over one you consume it. Thus, you typically want to structure your data in a way such that you don't have to backtrack. That said, this isn't always possible and most streams actually have a buffer under the hood two which characters can be returned. Since streams can be implemented by a user there is no guarantee that characters can be returned. Even for the standard streams there isn't really a guarantee.

If you want to return a character, you have to put back exactly the character you extracted:

char c;
if (in >> c && c != 'a')
    in.putback(c);
if (in >> c && c != 'b')
    in.unget();

The latter function has slightly better performance because it doesn't have to check that the character is indeed the one which was extracted. It also has less chances to fail. Theoretically, you can put back as many characters as you want but most streams won't support more than a few in all cases: if there is a buffer, the standard library takes care of "ungetting" all characters until the start of the buffer is reached. If another character is returned, it calls the virtual function std::streambuf::pbackfail() which may or may not make more buffer space available. In the stream buffers I have implemented it will typically just fail, i.e. I typically don't override this function.

If not, is there a way to "store" (and restore) the complete status of an input stream: state and data ?

If you mean to entirely restore the state you were at, including the characters, the answer is: sure there is. ...but no easy way. For example, you could implement a filtering stream buffer and put back characters as described above to restore the sequence to be read (or support seeking or explicitly setting a mark in the stream). For some streams you can use seeking but not all streams support this. For example, std::cin typically doesn't support seeking.

Restoring the characters is only half the story, though. The other stuff you want to restore are the state flags and any formatting data. In fact, if the stream went into a failed or even bad state you need to clear the state flags before the stream will do most operations (although I think the formatting stuff can be reset anyway):

std::istream fmt(0); // doesn't have a default constructor: create an invalid stream
fmt.copyfmt(in);     // safe the current format settings
// use in
in.copyfmt(fmt);     // restore the original format settings

The function copyfmt() copies all fields associated with the stream which are related to formatting. These are:

the locale
the fmtflags
the information storage iword() and pword()
the stream's events
the exceptions
the streams's state

If you don't know about most of them don't worry: most stuff you probably won't care about. Well, until you need it but by then you have hopefully acquired some documentation and read about it (or ask and got a good response).

What differences are they between failbit and badbit ? When should we use one or the other ?

Finally a short and simple one:

failbit is set when formatting errors are detected, e.g. a number is expected but the character 'T' is found.
badbit is set when something goes wrong in the stream's infrastructure. For example, when the stream buffer isn't set (as in the stream fmt above) the stream has std::badbit set. The other reason is if an exception is thrown (and caught by way of the the exceptions() mask; by default all exceptions are caught).

Is there any online reference (or a book) that explains deeply how to deal with iostreams ? not just the basic stuff: the complete error handling.

Ah, yes, glad you asked. You probably want to get Nicolai Josuttis's "The C++ Standard Library". I know that this book describes all the details because I contributed to writing it. If you really want to know everything about IOStreams and locales you want Angelika Langer & Klaus Kreft's "IOStreams and Locales". In case you wonder where I got the information from originally: this was Steve Teale's "IOStreams" I don't know if this book is still in print and it lacking a lot of the stuff which was introduced during standardization. Since I implemented my own version of IOStreams (and locales) I know about the extensions as well, though.

Thank you very much for this awesome and complete answer :) Clears almost all my doubts on the topic ! I will definitely try to get one of these books. — ereOn, Jan 12 '12 at 08:44

score 2 · Answer 2 · answered Jan 11 '12 at 17:15

So here are my questions:

Q: Where is my mistake here ?

I would not call your technique a mistake. It is absolutely fine.
When you read data from a stream you normally already know the objects coming off that stream (if the objects have multiple interpretations then that also needs to either be encoded into the stream (or you need to be able to rollback the stream).

Q: Should one write his calls to operator>> to leave the initial data still available after a failure?

Failure state should be there only if something really bad went wrong.
In your case if you are expecting a foobar (that has two representations) you have a choice:

Mark the type of object that is coming in the stream with some prefix data.
In the foobar parsing section use ftell() and fseek() to restore the stream position.

Try:

  std::streampos point = stream.tellg();
  if (is >> foo)
  {
    foobar = foo;
  }
  else
  {
    stream.seekg(point)
    is.clear();

Q: If so, how can I do that efficiently ?

I prefer method 1 where you know the type on the stream.
Method two can used when this is unknowable.

Q: If not, is there a way to "store" (and restore) the complete status of an input stream: state and data ?

Yes but it requires two calls: see

std::iostate   state = stream.rdstate()
std::istream   holder;
holder.copyfmt(stream)

Q: What differences are they between failbit and badbit ?

From the documentation to the call fail():

failbit: is generally set by an input operation when the error was related to the internal logic of the operation itself, so other operations on the stream may be possible.
badbit: is generally set when the error involves the loss of integrity of the stream, which is likely to persist even if a different operation is performed on the stream. badbit can be checked independently by calling member function bad.

Q: When should we use one or the other ?

You should be setting failbit.
This means that your operation failed. If you know how it failed then you can reset and try again.

badbit is when you accidentally mash internal members of the stream or do something so bad that to the stream object itself is completely forked.

Thank you very much for this very complete answer. I'm accepting *Dietmar Kühl* answer because it is also very good and provided some references to book on the topic. Upvoting this one for fairness though. — ereOn, Jan 12 '12 at 08:46

score 0 · Answer 3 · answered Jan 11 '12 at 16:38

When you serialize your FooBar you should have a flag indicating which one it is, which will be the "header" for your write/read.

When you read it back, you read the flag then read in the appropriate datatype.

And yes, it is safest to read first into a temporary object then move the data. You can sometimes optimise this with a swap() function.

What are the guidelines regarding parsing with iostreams?

3 Answers3

Related