6

I have the output of another program that was more intended to be human readable than machine readable, but yet am going to parse it anyway. It's nothing too complex.

Yet, I'm wondering what the best way to do this in C++ is. This is more of a 'general practice' type of question.

I looked into Boost.Spirit, and even got it working a bit. That thing is crazy! If I was designing the language that I was reading, it might be the right tool for the job. But as it is, given its extreme compile-times, the several pages of errors from g++ when I do anything wrong, it's just not what I need. (I don't have much need for run-time performance either.)

Thinking about using C++ operator <<, but that seems worthless. If my file has lines like "John has 5 widgets", and others "Mary works at 459 Ramsy street" how can I even make sure I have a line of the first type in my program, and not the second type? I have to read the whole line and then use things like string::find and string::substr I guess.

And that leaves sscanf. It would handle the above cases beautifully

if( sscanf( str, "%s has %d widgets", chararr, & intvar ) == 2 )
      // then I know I matched "foo has bar" type of string, 
      // and I now have the parameters too

So I'm just wondering if I'm missing something or if C++ really doesn't have much built-in alternative.

Scott
  • 1,176
  • 2
  • 13
  • 19
  • What do you mean by "built-in alternative"? Usually that means the standard library (only), but you're already using Boost. What are you asking? – Fred Nurk Feb 14 '11 at 03:53

7 Answers7

3

sscanf does indeed sound like a pretty good fit for your requirements:

  • you may do some redundant parsing, but you don't have performance requirements prohibiting that
  • it localises the requirements on the different input words and allows parsing of non-string values directly into typed variables, making the different input formats easy to understand

A potential problem is that it's error prone, and if you have lots of oft-changing parsing phrases then the testing effort and risk can be worrying. Keeping the spirit of sscanf but using istream for type safety:

#include <iostream>
#include <sstream>

// Str captures a string literal and consumes the same from an istream...
// (for non-literals, better to have `std::string` member to guarantee lifetime)
class Str
{
  public:
    Str(const char* p) : p_(p) { }
    const char* c_str() const { return p_; }
  private:
    const char* p_;
};

bool operator!=(const Str& lhs, const Str& rhs)
{
    return strcmp(lhs.c_str(), rhs.c_str()) != 0;
}

std::istream& operator>>(std::istream& is, const Str& str)
{
    std::string s;
    if (is >> s)
        if (s.c_str() != str)
            is.setstate(std::ios_base::failbit);
    return is;
}

// sample usage...

int main()
{
    std::stringstream is("Mary has 4 cats");
    int num_dogs, num_cats;

    if (is >> Str("Mary") >> Str("has") >> num_dogs >> Str("dogs"))
    {
        std::cout << num_dogs << " dogs\n";
    }
    else if (is.clear(), is.seekg(0), // "reset" the stream...
             (is >> Str("Mary") >> Str("has") >> num_cats >> Str("cats")))
    {
        std::cout << num_cats << " cats\n";
    }
}
Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
  • I really like the approach of stream-based validation here. It looks very natural, and it is still powerful enough to deal with more complex validations, e.g. `MatchIPv4Addr("10.0.0.0/8", &addr)` – Tom Feb 17 '11 at 03:09
  • @Tom: Interesting point and illustration. It's nice when you can have an `IPv4Addr addr` object directly support streaming ala `operator>>`, but its generally best to have only one such function so it's quite inflexible if there are app-specific notations in use. A separate "matcher" like you've suggested is very flexible. – Tony Delroy Feb 17 '11 at 03:53
  • `return strcmp(lhs.c_str(), rhs.c_str()) != 0;` should be written as `return lhs != rhs;` – user102008 Mar 18 '11 at 05:48
  • @user102008: I can't see it: we want to compare the textual content, and not the pointers. Consider that a string into which "xyz" has been streamed will store that at a different address to any string literal "xyz" known during compilation. – Tony Delroy Mar 18 '11 at 12:27
2

The GNU tools flex and bison are very powerful tools you could use that are along the lines of Spirit but (according to some people) easier to use, partially because the error reporting is a bit better since the tools have their own compilers. This, or Spirit, or some other parser generator, is the "correct" way to go with this because it affords you the greatest flexibility in your approach.

If you're thinking about using strtok, you might want to instead take a look at stringstream, which splits on whitespace and lets you do some nice formatting conversions between strings, primitives, etc. It can also be plugged into the STL algorithms, and avoids all the messy details of raw C-style string memory management.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • Now, have a look at http://stackoverflow.com/questions/750112/overengineering-how-to-avoid-it/750130#750130 – Charlie Martin Feb 17 '11 at 04:37
  • @Charlie Martin- Can you elaborate on how this is overengineering? I can understand your point, but I strongly disagree with it. This seems more like "using the right tool for the right job" or "avoid reinventing the wheel," whereas overengineering would be something like building your own parser generator or writing your own general parsing framework to handle all possible changes to the framework. – templatetypedef Feb 17 '11 at 04:55
  • Yes, I can. Using Bison and Flex requires constructing both a description of a lexer and a grammar. He's talking about an input stream that's free text but where he's apparently primarily interested in taking whitespace-separated fields and converting them to internal representations, ergo the sscanf suggestion. Note that overengineering is not "not invented here" -- whether it were home-brewed or GNU software, using a parser generator for this is swatting a fly with a 9 pound sledge. – Charlie Martin Feb 17 '11 at 15:59
1

I've written extensive parsing code in C++. It works just great for that, but I wrote the code myself and didn't rely on more general code written by someone else. C++ doesn't come with extensive code already written, but it's a great language to write such code in.

I'm not sure what your question is beyond just that you'd like to find code someone has already written that will do what you need. Part of the problem is that you haven't really described what you need, or asked a question for that matter.

If you can make the question more specific, I'd be happy to try and offer a more specific answer.

Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466
1

I've used Boost.Regex (Which I think is also tr1::regex). Easy to use.

Guy Sirton
  • 8,331
  • 2
  • 26
  • 36
0

there is always strtok() I suppose

Martin Beckett
  • 94,801
  • 28
  • 188
  • 263
0

Have a look at strtok.

Charlie Martin
  • 110,348
  • 25
  • 193
  • 263
0

Depending on exactly what you want to parse, you may well want a regular expression library. See msdn or earlier question.

Personally, again depending the exact format, I'd consider using perl to do an initial conversion into a more machine readable format (E.g. variable record CSV) and then import into C++ much more easily.

If sticking to C++, you need to:

  1. Identify a record - hopefully just a line
  2. Determine the type of the record - use regex
  3. Parse the record - scanf is fine

A base class on the lines of:

class Handler
{
public:
    Handler(const std::string& regexExpr)
        : regex_(regexExpr)
    {}
    bool match(const std::string& s)
    {
        return std::tr1::regex_match(s,regex_);
    }
    virtual bool process(const std::string& s) = 0;
private:
    std::tr1::basic_regex<char> regex_;
};

Define a derived class for each record type, stick an instance of each in a set and search for matches.

class WidgetOwner : public Handler
{
public:
    WidgetOwner()
        : Handler(".* has .* widgets")
    {}
    virtual bool process(const std::string& s) 
    {
        char name[32];
        int widgets= 0;
        int fieldsRead = sscanf( s.c_str(),  "%32s has %d widgets", name, & widgets) ;

        if (fieldsRead == 2)
        {
            std::cout << "Found widgets in " << s << std::endl;
        }
        return fieldsRead == 2;
    }
};

struct Pred 
{
    Pred(const std::string& record)
        : record_(record)
    {}
    bool operator()(Handler* handler)
    {
        return handler->match(record_);
    }
    std::string record_;
};

std::set<Handler*> handlers_;
handlers_.insert(new WidgetOwner);
handlers_.insert(new WorkLocation);

Pred pred(line);
std::set<Handler*>::iterator handlerIt = 
     std::find_if(handlers_.begin(), handlers_.end(), pred);
if (handlerIt != handlers_.end())
    (*handlerIt)->process(line);
Community
  • 1
  • 1
Keith
  • 6,756
  • 19
  • 23