22

OK, I understand this question may sound quite opinion-based, however since I have several specific criteria of choice, I think it would make a nice fit for SO. So, here I am...

I've worked with compiler/interpreter construction in the past quite a lot (mostly as a hobby obviously) and for some reason I stuck with Lex/Yacc (or Flex/Bison, I'm quite confused as to how they call them now... lol).

However, since I'm finding myself currently playing with yet another hobbyist interpreter project, I thought I should give a try to something different maybe in order to avoid what I don't like about Lex/Yacc.

So, namely :

  • Better C++-friendly (than C-friendly)
  • Good documentation (preferably with some existing grammars already implemented + instructions on how to compile/use them - sounds rather obvious, huh?)
  • Could be LALR, LL(*), Recursive descent, I don't really care (note: an input as to which type you would prefer and for what type of implementations would be great though; I've never really understood their pros and cons, to be perfectly honest, even though I do know what they refer to)
  • Combining the Lexer part and the Parser grammar in one file wouldn't be bad at all; never really got it why it has to be split in two.
  • Last but not least : I've always had issues with... issues. I mean - at least as far as Lex/Yacc go, parsing error messages are more-or-less cryptic (Syntax Error... Yuhuu!) and rarely do they help diagnose an issue. (Well, unless you are the one who developed the interpreter... lol). So, is there anything better than Lex/Yacc regarding error reporting?

OK, I hope that wasn't too verbose. I'm all ears! :-)

Dr.Kameleon
  • 22,532
  • 20
  • 115
  • 223
  • 2
    You should ask this on softwarerecommendations, a new site on stackexchange.com. – Ira Baxter Mar 01 '14 at 18:27
  • 1
    For a first look, check out http://en.wikipedia.org/wiki/Compiler-compiler – Ira Baxter Mar 01 '14 at 18:32
  • 2
    For the reference: parsing expression grammars are another nice way of having the lexer and parser specified concurrently. That's handy if you can't use something like ANTLR. PEGs enable context-sensitive lexer rules. At least one notable C++ implementation of PEG is Dr. Hirsch's [PEGTL](https://github.com/colinh/pegtl), a MIT-licensed dependency-free header-only PEG lexer/parser implementation in C++11. The lexer and the parser are described using the same language, and it's all written using plain C++ expressions, no need for a separate tool run. – Kuba hasn't forgotten Monica Jan 23 '16 at 19:18

3 Answers3

35

I've been building parser generators and parsers since 1969.

Recursive descent, YACC and JavaCC are the typical answers you hear.

These are your grandpa's parser generators, and suffer from limitations in the grammars they will accept. Invariably, (esp on Stack Overflow), some poor soul asks "how do I solve this shift/reduce" problem (for LR parser generators like YACC) or "how do I eliminate left recursion" (for recursive descent or LL parser generators like JavaCC). Worse, they can't handle grammars that really have syntactic ambiguity, as occurs in most complex languages.

GLR (and GLL) parsers allow you to write context-free grammars ... and parse them, with no fuss or muss. This is a real productivity enhancement. There's a price: you can end up with ambiguous parses but there are ways to handle that. (see this discussion of C++ parsing troubles that neither YACC nor JavaCC can handle by themselves).

Bison (widely available) has a GLR option; use it! Recent multilingual program manipulation tools seems to all use GLL or GLR. Our DMS Software Reengineering Toolkit uses GLR and parses C++ (full C++14 in MS and GNU variants!), Java, COBOL, and a raft of other complicated languages; GLR has been one of the best technical choices I've made in my career. Stratego uses GLR. I think RascalMPL uses GLL. Scott McPeak's Elkhound GLR parser generator is C++ based and generates, I'm pretty sure, C++ code (OP asked for a C++ based answer).

Hot topics these days are PEG and ANTLR4. These are better that LL or LR parsers but still give one grief in trying to shape the grammar. (With PEG, you have to order the productions, assuming you can find such an order, to handle ambiguous rules with priorities. With ANTLR4, you still have specify lookaheads to resolve ambiguity; I don't know how it handles infinite lookahead). AFAIK, nobody has built practical C++ parsers with either of these technologies, so they aren't living up to their reputation.

I think GLR and GLL are much, much better answers.

peterh
  • 11,875
  • 18
  • 85
  • 108
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • 1
    Well, and imagine I thought my ±20 years of experience sounded quite good! lol. Thanks a lot for your input! I'll definitely have a look into GLR. As for ANTLR4, yep I've tried it, though (not that it's not easy to use) it seems I feel most comfortable with Bison... :-) – Dr.Kameleon Mar 04 '14 at 06:39
  • Hi, I just read your post, can you give an advice about where to start coding a parser/lexer to someone that has never done something in this field before ? What is the case where I can implement something challenging, maybe get a lot of errors, but learn how things works ? Maybe you can suggest to implement a real language that is out there with a specific tool/grammar ( EBNF or PEG ) because based on your experience this hypothetical exercise is a good gym ? Or some excercise to really understand the LL vs LR thing aside the EBNF vs PEG religious stuff . – user2485710 Aug 22 '14 at 04:38
  • 1
    The short answer: pick any of those tasks and go to it. In your shoes, I'd pick my favorite language, and go build a lexer for it; you'll find that both harder (detail wise) and easier (machinery) than you think. Watch out for "white space". If you get past that, I'd pick ANTLR and try to write a parser. These will provided with a good basic experience and foundation. – Ira Baxter Aug 22 '14 at 12:24
  • 1
    @user2485710 The Levine/Mason/Brown [Lex&Yacc](http://www.cs.csubak.edu/~jcourtne/O'Reilly%20-%20Lex%20and%20Yacc.pdf) (pdf link!) book is a decent introduction. It includes practical examples such as a complete SQL parser, and demonstrates a lot of tricks of the trade that are otherwise hard to find in one place. I've found it to be an excellent springboard into contemporary bison. – Kuba hasn't forgotten Monica Jan 23 '16 at 18:08
  • I agree that GLR is the way to go, especially if you're designing a new language and wish the freedom to experiment with grammar. It's a liberating feeling :) – Kuba hasn't forgotten Monica Jan 23 '16 at 18:15
  • 1
    Thank you for sharing all of this. This should be the top answer. – Arjun Menon Jun 29 '19 at 16:16
10

I'm just going to answer the last question, with a slight edit:

at least as far as Lex/Yacc go, parsing error messages are more-or-less cryptic (Syntax Error... Yuhuu!) and rarely do they help diagnose an issue. (Well, unless you are the one who developed the interpreter... lol). So, is there anything better than a better way to use Lex/Yacc regarding error reporting?

Well, start by using a modern version of bison, which has a reasonably complete manual online (and quite possibly installed with the executable, depending on how you install bison). In particular, start with these declarations:

%define parse.error verbose
%define parse.lac full

That will at least replace the mysterious "syntax error" error with a list of "expected" token types.

Then make sure that your token types have meaningful names, because they will get presented to the user as part of the error message. If you're used to using IDENTIFIER as a terminal, then you're probably ok, but the message "Expected TOK_YY_ID" is a bit geeky. You can declare a readable for a terminal in the type declaration:

%type TOK_YY_ID "identifier"

That will only take you so far. In many cases, knowing what was "expected" is sufficient to understand a syntax error, but sometimes it's useful to be more explicit. In such cases, it's useful to actually define error rules. Getting these right is more an art than a science, but that's true of all approaches to error reporting/recovery; the key is to try to be as specific as possible about what the erroneous syntax looks like, and not more specific than necessary.

One interesting approach to error reporting is to use the current parser state and lookahead token (both of which are visible at the moment of error reporting) to lookup up a custom error message, if one exists. I think this approach has been a part of compiler folklore for a long time, and I'm sure I've seen several articles about it over the decades. Here is a relatively recent article by Russ Cox.

rici
  • 234,347
  • 28
  • 237
  • 341
3

Interesting question - not sure I have a great answer to your actual question, but my "comment" got a bit too long for a comment...

I'm working in a Pascal compiler, and I've pretty much written a Lexer, Tokenizer and Parser (including producing AST to go into a code generator for LLVM) in around 1100 lines, if I may say so myself, pretty "nice", of C++ code - all by hand. It's much more friendly towards generating good error messages, and it helps in. There are several bits missing, and I still have plenty of work left before my compiler is complete, but I can compile some fairly complex code.

I admit, I've never used Lex/Yacc or Flex/Bison for anything real. I have sometimes looked at it, but I find it hard to use these tools, and you either end up taking the generated code and modifying it (bad idea with autogenerated code), or with bad error handling, and hard to debug code on top of that. But then, I just spend about two hours trying to find an error caused by "eating" a semicolon too early, in turn causing the parser to get lost in the token flow...

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • Well, to say I didn't think (like a million times) about writing the actual lexer/parser myself, from scratch, would be a lie. I'm exactly that... reinvent-the-wheel type of coder (we're talking about compiler construction, so that should be pretty obvious, huh? lol). However, I still believe Parser Generators (apart from Error messages, if I had to single out one major flaw of theirs) can really make the parsing part (which is *not* the core part of the compiler, imho) very easy. It's more-or-less like instantly turning your thoughts from paper to a parseable grammar... :-) – Dr.Kameleon Mar 01 '14 at 01:11
  • *Sidenote:* As you have probably already guessed by my last comment, I'm mostly interested in Language (+compiler) construction (I mean: not creating a compiler for an existing language), so being able to construct my own EBNF grammars, quite "visually", is a big boost in productivity. – Dr.Kameleon Mar 01 '14 at 01:16
  • Sorry, can't really help there. I did have a quick google for something written by one of my colleagues, but didn't find anything particularly great. He's been working on something written in ML that does pretty much what you are asking for, including having built-in documentation for graphically displaying the syntax of the language. – Mats Petersson Mar 01 '14 at 01:20
  • I'm also interested in compilers, but more in a practical way - at work, we have compiler technology (in particular, I work with OpenCL), and I want to understand more about how our compiler works - so writing my own compiler is a way of learning how the core component of that, LLVM, works. – Mats Petersson Mar 01 '14 at 01:22
  • Well, I must say I love your point of view. Unfortunately, what I'm making a living with is far less complex than compilers, and probably I'm in search of something a bit more challenging - as hobby? yep, why not... :-) – Dr.Kameleon Mar 01 '14 at 01:36
  • Not sure who's the "luckier" there... Either way, I picked Pascal because it's pretty easy to parse. Other languages aren't all easy... – Mats Petersson Mar 01 '14 at 01:37
  • http://www.youtube.com/watch?v=SEOJKg_Wfhc - and I rest my case. lol. Have a nice day, my friend. Greetings from some thousand miles away! :-) – Dr.Kameleon Mar 01 '14 at 01:42
  • 2
    But note that Pascal whas designed carefully to be easy to parse (and in general process) top-down, start to finish. No wonder it was easy to write a parser by hand. – vonbrand Mar 04 '14 at 00:52