Are there any free parser generators that generate C++ code and handle Unicode correctly?

Question

After asking this question, I'm now sold on trying to use a parser generator, where before I was going to write things manually.

However, I can't seem to find any such parser that generates C++ code, nor can I find a parser that correctly handles Unicode. (note that my input is in UCS-2 -- I don't care about supporting bits outside of the Basic Multilingual Plane if that makes building the parser more difficult)

There are some parsers which can generate C, but such parsers all seem to throw exception safety out the window, which would prevent me from using C++ inside any semantic actions.

Does a parser generator exist which meets these two tenets, or am I stuck doing everything by hand?

EDIT: Oh, and my project is BSL licensed, so there can't be many restrictions on use of the output of the parser generator itself.

More details on requirement please? Parsers don't care about character sets: it is the lexer than handles this. If you only require comments and string literals to be Unicode you can do that easily with most existing 8-bit clean lexers by transcoding to UTF-8. Slightly harder to handle unicode identifiers. — Yttrill, Dec 04 '10 at 20:19
@Yttrill: Input is UTF-16 (Well, really UCS-2, because I'm not supporting things outside the BMP) stored in a `std::wstring`. I don't see what identifiers have to do with anything because neither the lexer nor parser really need to care about those -- that'd be my job to handle in semantic actions. — Billy ONeal, Dec 04 '10 at 20:31
The problem is lexing identifiers. The parser will just see IDENTIFIER with a string attribute or something, it doesn't care much what's in the string: the string can be std::wstring or even std::string with UTF-8 encoding. — Yttrill, Dec 04 '10 at 21:58
@Billy ONeal: Yes, parser and lexer need to be compatible but that's usually not too hard to organise or hack around. If you have a token IDENTIFIER which has std::string attribute, you can easily transcode your UCS-2 std::wstring into a std::string using UTF-8 encoding, for example. The real problem is actually lexing the identifier, i.e. finding the start and end points in the input. — Yttrill, Dec 04 '10 at 22:16
@Billy - the exception safety issue is mostly quite easy to work around. If parser actions build an AST-like data structure, they need do little more than allocate nodes - exceptions should only happen if you have non-trivial constructors (which you shouldn't) or run out of memory. Tree-walking code is separate, and as exception-safe as you make it. Handling out-of-memory well may be a bit awkward, though. — , Dec 10 '10 at 23:48
@Steve: If I'm going to spend all that time working around the parser generator, then I might as well do it by hand. — Billy ONeal, Dec 11 '10 at 02:57
@Billy - An AST is normal for any significant parsing job anyway, since the order that the parsing algorithm discovers instances of grammar rules is rarely the best order to evaluate them. It doesn't have to be raw nodes and pointers - you can use containers like `std::map`. The normal approach to out-of-memory is to ignore the possibility, but a single exception "that file is too big to parse" should be easy enough. — , Dec 12 '10 at 11:29
It's an old question I noticed only now. Want to suggest trying AXE parser generator, it should work with unicode, or even with binary or mixed streams, it doesn't really care. If you find errors, let me know. — Gene Bushuyev, Jun 02 '11 at 21:44

Industrial-antidepressant · Answer 1 · 2010-12-11T15:53:32.813

There are two way in C++. Using a program, that genereates C++ files from a grammar that is written in a free form or using templates.

And you have two choice when you writing a grammar in template types. Using the boost::proto, where every operator is redefinied to build a syntax tree in boost::fusion (used in boost::spirit, boost::msm, boost::xpressive). (basic idea is here:Expression Templates) or building an expression tree written by hand with the help of own templates and store it directly boost::mpl containers. This thecnique is used in biscuit.

In biscuit you have

or_<>, seq_<>, char_<>, ..

templates. Biscuit is based on Yard, but extended with an extended boost::range to get a better submatch capabaility.

The Biscuit Parser Library 1

The Biscuit Parser Library 2

Yet Another Recursive Descent (YARD) parsing framework for C++

In general, please provide a small summary or something when posting links, so a reader wanting to get an overview of the answers doesn't have to follow too many links, and so those who *do* want to learn the details have an idea about what they're clicking on. (And also so people have a reason to upvote your answer. Hard to justify an upvote when you haven't added anything yourself. :)) — jalf, Dec 11 '10 at 06:10

score 4 · Answer 2 · answered Dec 04 '10 at 15:52

Alright this might be a long shot but there is a parser generator (LALR) as a side project to Qt it is called QLALR it is a really thin layer, the lexing is still up to you, but all the work can be done through QStrings which support unicode. There is not a lot of functionality to it, you write the grammar with the code that does the work for each token, and it will generate the parser for you. But I have used it successfully generate a parser with ~100 rules, creating an AST of the language parsed.

score 1 · Answer 3 · answered Nov 30 '10 at 19:44

1

ANTLR has Unicode support. It has C++ (and C, Java and a few other languages) support, though I've never used the C++ support so I'm not sure how well developed it is.

answered Nov 30 '10 at 19:44

Laurence Gonsalves

137,896
35
246
299

1

ANTLR does not generate C++ code. (C only) (See http://www.antlr.org/wiki/display/ANTLR3/Code+Generation+Targets ) – Billy ONeal Nov 30 '10 at 19:45
2

It says "C target as of release 3.1 is C++ compatible". Right at the bottom of the page George Shannon says: "I am currently working on a C++ runtime target." That was from February, so maybe contact him and find out the progress. – Laurence Gonsalves Nov 30 '10 at 19:52
1

I've spent some time playing with it. The problem is that if an exception is thrown the entire works comes crashing down. – Billy ONeal Nov 30 '10 at 19:57
3

The fact that the generated C code is C++ compatible does not mean ANTLR generates C++ code (which it doesn't). Also, besides George Shannon's wiki-post, I've never seen any detaisl about his supposed C++ target development on the ANTLR mailing lists. But you're right, Billy could try and e-mail the guy, of course. – Bart Kiers Dec 04 '10 at 09:18

score 1 · Answer 4 · answered Nov 30 '10 at 20:13

1

There appears to be preliminary support for unicode in boost::spirit

answered Nov 30 '10 at 20:13

SingleNegationElimination

151,563
33
264
304

Interesting. Is there some way to make it handle errors more gracefully? Seems it just fails if there's input that doesn't match the grammar... +1 – Billy ONeal Dec 04 '10 at 01:48

score 1 · Answer 5 · answered Nov 30 '10 at 20:18

1

if you're in the mood to experiment, this one supports wide chars but is obscure: http://wiki.winprog.org/wiki/LibCC_Parse

answered Nov 30 '10 at 20:18

tenfour

36,141
15
83
142

score 1 · Answer 6 · answered Dec 04 '10 at 20:48

1

The parser doesn't care about characters since it processes tokens.

Lexing Unicode is very expensive. This is because you either pay a huge function calling overhead for classification, or you kill your memory with massive tables. Normally you'd only support Unicode is specific places in a PL, such as string literals and perhaps identifiers where a handcrafted function can do the job efficiently.

I once coded in Ocamllex a lexer that would accept the identifiers mandated by the ISO C++ standard (which includes a set of ranges of Unicode code points considered as "letters" in various languages). Although the number of code point ranges is quite small (around 20 or so ranges), the UTF-8 DFA for this has over 64K states and blew up the lexer generator :)

My advice here is: you will have to hand craft your lexer. It is, in fact, very easy to do this inefficiently. Doing it efficiently is very much harder: I'd be looking at Judy arrays for support (this is the fastest data structure on the planet).

answered Dec 04 '10 at 20:48

Yttrill

4,725
1
20
29

Well, sure the parser doesn't care. But the parser and the lexer have to have compatible interfaces. UTF-8 lexing would be very expensive, but using UCS-2 means that the code points are fixed-width, which should make parsing easier. – Billy ONeal Dec 04 '10 at 21:01
1

Is UTF8 lexing really a problem? Lexing is usually a pretty fast process, compares to what happens in later passes. Seems like this concern might be premature optimization. – jalf Dec 11 '10 at 06:13
1

The solution is not to make the UTF-8 encoding part of the lexer but having it as a lower level. The lexer would call a `getcodepoint()` where for 8-bit characters it would normally have `getchar()`. The lexer would be a state machine for those codepoints only, not for the format of UTF-8 too. `getcodepoint()` would read the stream of octets, know the encoding of UTF-8, and return 32 bit values (or 16 bit if you only care about the BMP. This would also result in an extra place to catch errors/exceptions. As well as reading a bad token it would be possible to read a malformed codepoint. – hippietrail May 09 '12 at 18:36
@jalf: It might be counter-intuitive, but it's pretty well known that lexing is actually the slowest stage of a compiler! But I don't believe lexing UTF-8 or UTF-16 need be slower than lexing 7- or 8-bit character sets. – hippietrail May 09 '12 at 18:38
@hippietrail: Well known by who? That certainly depends on the compiler, and the language. – jalf May 09 '12 at 19:53
1

@jalf: ["The lexical analyzer is usually the slowest part of a compiler since it has to read every single character of the input file."](http://web.eecs.utk.edu/~bvz/cs461/notes/flex/), ["I gather that the lexer is often the slowest part of modern compilers, because it's the only phase that has to look at each character of the input individually."](http://compilers.iecc.com/comparch/article/02-04-141), ["Lexer may be slowest part of a compiler, speedup techniques therefore useful."](https://csimoodle.ucd.ie/moodle/mod/resource/view.php?id=10924), etc. – hippietrail May 09 '12 at 21:51
1

@jalf: In fact our own Joel makes it pretty clear: ["As every compiler writer knows, lexing and parsing are the slowest part of compiling. Suffice it to say that it involves a lot of string stuff, which we discovered is slow, and a lot of memory allocation stuff, which we discovered is slow, as we lex, parse, and build an abstract syntax tree in memory."](http://www.joelonsoftware.com/articles/fog0000000319.html) – hippietrail May 09 '12 at 21:53
@hippietrail: very good. Now try actually measuring it for a C++ compiler (which is what this question is about). I think you'll find that lexing is the easy part. (Or, to give you a simple example, compare compilation speed for C and C++ code. The lexing phase is virtually identical. And yet C++ code typically compiles *a lot* slower. – jalf May 09 '12 at 21:58
Yes C and C++ are also slow due to the header file concept and the linker stage but C++ has complex features like templates which I imagine must be one of the slowest parts. It would be nice to hear some actual stats. – hippietrail May 09 '12 at 22:06
The Clang compiler's website has a few benchmarks which, among other things, give an idea of how much time is spent in different compilation stages, if you're interested. – jalf May 09 '12 at 22:30

score 0 · Answer 7 · answered Dec 09 '10 at 15:50

0

Try Boost.Spirit. You can plug-in your own "stream decoder", which handles the unicode-part of your problem. To make Sprit work with wchar_t should be possible -- although, I have not tried it myself.

answered Dec 09 '10 at 15:50

towi

21,587
28
106
187

It does work, but I have the damndest time getting anything complicated to compile with it. (As I said to [the other boost spirit answer](http://stackoverflow.com/questions/4317799/are-there-any-free-parser-generators-which-generate-c-code-and-handle-unicode-c/4318094#4318094) ) – Billy ONeal Dec 09 '10 at 16:17
Yes, you are probably right. Although, if one manages to learn it, it teaches a lot about C++-ish paradigms :-) And yes, error handling is often the *real* tricky part, no matter what framework. – towi Dec 10 '10 at 13:54

score 0 · Answer 8 · answered Dec 09 '10 at 19:44

0

I don't know a whole lot of theory about parsers, so forgive me if this doesn't fit the bill, but there is Ragel.

Ragel generates state machines. It's (perhaps most famously?) used by the Mongrel HTTP server for Ruby to parse HTTP requests.

Ragel targets plain C (amongst others), but all of the state machine data is either static const or stack allocated, so that should alleviate some important concerns with C++ exceptions. If special exception handling is required, Ragel doesn't shy away from exposing its internals. (Not as complex as it may sound.)

Unicode should be possible, because input is an array of any basic type, usually char, but probably short or int in your case. If that doesn't do, you can even replace the array iteration with your own mechanism for getting the next input item/token/event.

answered Dec 09 '10 at 19:44

Stéphan Kochen

19,513
9
61
50

My problem isn't actually the library itself generating exceptions, but exceptions in my semantic actions. – Billy ONeal Dec 09 '10 at 21:50
Ragel is more a lexer generator than a parser generator - a lot more powerful than something like lex, but still working with regular grammars. There is a related kelbt tool for backtracking LR parsing, but it isn't supported the way ragel is. I think I remember something about Ragel having some problems handling large character sets due to bit-flags it adds for its own purposes - but that could easily be completely wrong. I use it all the time, but that has only meant minor tweaks to existing grammars for quite a while, and I only work with 8 bit characters. – Dec 10 '10 at 23:36

Are there any free parser generators that generate C++ code and handle Unicode correctly?

8 Answers8