Best parser generator for parsing many small texts in C++?

Question

I am, for performance reason, porting a C# library to C++. During normal operation, this library needs, amongst other things, to parse about 150'000 math expressions (think excel formulas) with an average length of less than 150 characters.

In the C# version, I used GOLD parser to generate parsing code. It can parse all 150'000 expressions in under one second.

Because we were thinking about extending our language, I figured the move to C++ might be a good chance to change to ANTLR. I have ported the (simple) grammar to ANTLR and generated C code out of it. Parsing the 150'000 expressions takes over 12 seconds, because for each expression, I need to create a new ANTL3_INPUT_STREAM, token stream, lexer and parser - there is, at least in version 3.4, no way to reuse them.

I'd be grateful is someone could give me a recommendation what to use instead - GOLD is of course an option though generating C++ or C code seems a lot more complicated than the C# variety. My grammar is LALR and LL(1) compatible. Paramount concern is parsing performance on small inputs.

Down to 400 ms (from 12s) with boost spirit - thanks a bunch! — user816098, Dec 02 '11 at 16:11

score 9 · Accepted Answer · answered Dec 02 '11 at 12:25

9

I would try boost::spirit. It is often extreamly fast (even for parsing simple things like an integer it can be faster than the C function atoi http://alexott.blogspot.com/2010/01/boostspirit2-vs-atoi.html)

http://boost-spirit.com/home/

It has nice things : header only, so dependency hell, liberal licence.

However be warned that the learning curve is difficult. It's modern C++ (no pointer, but a lot of template and very frustrating compiling errors), so coming from C or C#, you might not be very comfortable.

answered Dec 02 '11 at 12:25

Tristram Gräbener

9,601
3
34
50

1

Thank you and Matthieu M. for suggesting boost spirit - I will give it a try despite the complexity warnings (nothing like a challenge, and I am used to some hefty C++). I'd like to avoid writing my own parser as changes or extension in our language would be easier to integrate with something like ANTLR or boost spirit. – user816098 Dec 02 '11 at 12:39

score 4 · Answer 2 · answered Dec 02 '11 at 12:30

4

If the grammar to be parsed is simple, you might just write the parser by hand.

Most parser generators are designed to make it easy to whip up a working parser, and execution time often suffers as a result.

answered Dec 02 '11 at 12:30

jalf

243,077
51
345
550

score 3 · Answer 3 · answered Dec 02 '11 at 12:24

The best performance I have seen in parsing came from Boost.Spirit.Qi which expresses the grammar in C++ using meta-template programming. It is not for the faint of heart though.

This will need be well insulated, and the compilation time of the file containing the parser will increase to several seconds (so better make sure there is as few as possible there).

score 2 · Answer 4 · answered Dec 02 '11 at 12:31

If the syntax of your expression is simple enough, consider making a hand-written recursive descent parser. It can run really fast, and give you the ability (with enough care) to report nicely and specifically syntax errors.

You could also use bison, but I believe a hand-written recursive parser would probably go faster.

And you can do lexing with a flex generated lexer, and do the parsing manually, in a recursive descent way.

For your information the GCC compiler has its own recursive-descent parsers for C++ & C at least. It does not use parser generators (like bison or ANTLR) anymore.

chill · Answer 5 · 2011-12-02T14:44:22.043

1

Instead of expr make you grammar recognize sequence-of-expr.

EDIT:

Instead of having (bison syntax):

start: expr { process_expr ($1); }
     ;

have:

start: expr_seq ;

expr_seq:   expr          { process_expr ($1); }
          | expr_seq expr { process_expr ($2); }
          ;

edited Dec 02 '11 at 14:44

answered Dec 02 '11 at 12:22

chill

16,470
2
40
44

1

Sadly, this is not an option, as the expressions I need to parse depend mostly on previous parsing results (they come out of a much larger pool and represent a dependency chain). – user816098 Dec 02 '11 at 12:25
1

@user816098, dependencies between expressions should encourage the `sequence-of-expr` way. If you have a time post sample input and desired output, please. – Kirill Dec 02 '11 at 12:47
1

Kirill, imagine an excel spreadsheet with 25 billion cells. One cell may depend on 5 other cells, which in turn depend on others etc. - typically, this chain is 150'000 cells long. As the set of cell references only emerges during parsing, there is no way to use sequence-of-expr without feeding all 25 billion expressions, which wouldn't be effective. – user816098 Dec 02 '11 at 12:57
3

@user816098, the dependencies are orthogonal to the problem of creating 150'000 parser instances. Handle the dependencies in your semantic actions. – chill Dec 02 '11 at 14:37
Sounds like #include functionality. You can do this with PUSHSTREAM() which tells ANTLR to start parsing a different input using the same lexer. – Dec 02 '11 at 20:16

Mike Dunlavey · Answer 6 · 2011-12-02T19:30:45.040

1

I've written many parsers, and hand-coded recursive-descent is the way I do it. They are easy to write and pretty much optimal.

That said, if speed is what you're after, no matter what you write there will be plenty of room to speed it up. These will be in ways that could surprise you, because anything you could think of, you would have already done.

Here's a slide set showing how to do it.

edited Dec 02 '11 at 19:30

answered Dec 02 '11 at 13:31

Mike Dunlavey

40,059
14
91
135

Best parser generator for parsing many small texts in C++?

6 Answers6

Linked