29

According to Vala documentation: "Before 0.3.1, Vala's parser was the classic flex scanner and Bison LALR parser combination. But as of commit eba85a, the parser is a hand-crafted recursive descent parser." My question is: Why?

The question could be addressed to any compiler which isn't using parser generator. What are pros and cons for such a move from parser generator to hand-crafted parser? What are disadvantages of using parser generators (Bison, ANTLR) for compilers?

As a side comment: I'm interested in Vala specifically because I like the idea of having language with modern features and clean syntax but compilable into "native" and "unmanaged" high-level language (C in case of Vala). I have found only Vala so far. I'm thinking of having fun by making Vala (or similar language) compilable to C++ (backed by Qt libs). But since I don't want to invent completely new language I'm thinking of taking some existing grammar. Obviously hand-crafted parsers don't have written formal grammar I might reuse. Your comments on this idea are welcome (is the whole idea silly?).

vladimir
  • 2,635
  • 3
  • 23
  • 26
  • 2
    Have you asked Jürg Billeter (author of the commit)? – Sam Harwell Mar 28 '13 at 02:36
  • Hmmm, no, I haven't. I will try to reach him. I'm changing Title of my question for making it more general. – vladimir Mar 28 '13 at 02:42
  • Likely hand-crafting a parser can be made to be faster/more space efficient, since it doesn't have to be generic and may be able to use more specific tricks. – Patashu Mar 28 '13 at 02:54
  • 2
    You may want to look at Mozilla's Rust project, which already targets C++. – apmasell Mar 28 '13 at 03:33
  • Error messages and error recovery are handled much better in a handwritten parser. Of course there are nice ways of generating such code from some high level declarative language, but these modern ways are unknown (or too scary) to the yacc-infected LR folks. They would rather prefer to write a recursive descent parser manually then go into something PEG-based. – SK-logic Mar 28 '13 at 09:12
  • Also see http://mortoray.com/2012/07/20/why-i-dont-use-a-parser-generator for some more issues with generators. – Trass3r May 20 '14 at 15:06
  • One advantage with parser generator is its ability to target any programming language. For example, one can generate lexer/parser in Python as well as in Java using Antlr. This is definitely an important lookout for use cases wherein you're targeting multiple programming communities. – asyncwait Jan 08 '17 at 14:09

5 Answers5

29

I have written half a dozen hand crafted parsers (in most cases recursive descent parser AKA top-down parser) in my career and have seen parsers generated by parser generators and I must admit I am biased against parser generators.

Here are some pros and cons for each approach.

Parser generators

Pros:

  • Quickly get a working parser (at least if you do not know how to hand code it).

Cons:

  • Generated code is hard to understand and debug.
  • Difficult to implement proper error handling. The generator will create a correct parser for syntactically correct code but will choke on incorrect code and in most cases will not be able to provide proper error messages.
  • A bug in parser generator may halt your project. You need to fix the bug in somebody else's code (if source code is available), wait for the author to fix it or workaround the bug (if possible at all).

Hand crafted recursive descent parser

Pros:

  • Generated code is easy to understand. Recursive parsers usually have one function corresponding to each language construct, e.g. parseWhile to parse a 'while' statement, parseDeclaration to parse a declaration and so on. Understanding and debugging the parser is easy.
  • It is easy to provide meaningful error messages, to recover from errors and continue parsing in the way that makes most sense in a particular situation.

Cons:

  • It will take some time to hand code the parser especially if you do not have an experience with this stuff.

  • The parser may be somewhat slow. This applies to all recursive parsers not just hand written ones. Having one function corresponding to each language construct to parse a simple numeric literal the parser may make a dozen or more nested calls starting from e.g. parseExpression through parseAddition, parseMultiplication, etc. parseLiteral. Function calls are relatively inexpensive in a language like C but still cam sum up to a significant time.

One solution to speedup a recursive parser is to replace parts of your recursive parser by a bottom-up sub-parser which often is much faster. The natural candidates for such sub-parser are the expressions which have almost uniform syntax (i.e. binary and unary expressions) with several precedence levels. The bottom-up parser for an expression is usually also simple to hand code, it is often just one loop getting input tokens from the lexer, a stack of values and a lookup table of operator precedence’s for operator tokens.

TN.
  • 611
  • 1
  • 4
  • 9
19

LR(1) and LALR(1) parsers are really, really annoying for two reasons:

  1. The parser generator isn't very good at producing useful error messages.
  2. Certain kinds of ambiguity, like C-style if-else blocks, make writing the grammar very painful.

On the other hand, LL(1) grammar are much better at both of these things. The structure of LL(1) grammars makes them very easy to encode as recursive descent parsers, and so dealing with a parser generator is not really a win.

Also, in the case of Vala, the parser and the compiler itself as presented as a library, so you can build a custom backend for the Vala compiler using the Vala compiler library and get all the parsing and type checking and such for free.

apmasell
  • 7,033
  • 19
  • 28
  • +1 I would expect errorhandling to be the primary cause. Ambiguities and repeated rule writing can sometimes also get handled by extensions in the parser generator (bison does some) – Marco van de Voort Mar 28 '13 at 22:04
2

I know this isn't going to be definitive, and if your questions weren't specifically Vala-related I wouldn't bother, but since they are...

I wasn't too heavily involved with the project back then so I'm not that clear on some of the details, but the big reason I remember from when Vala switched was dogfooding. I'm not certain it was the primary motivation, but I do remember that it was a factor.

Maintainability was also an issue. That patch replaced a larger parser written in C/Bison/YACC (relatively few people have significant experience with the latter two) with a smaller parser in Vala (which pretty much anyone interested in working on valac probably knows and is comfortable with).

Better error reporting was also a goal, IIRC.

I don't know if it was a factor at all, but the hand-written parser is a recursive descent parser. I know ANTLR generates those, the ANTLR is written in Java, which is a pretty heavy dependency (yes, I know it's not a run-time dependency, but still).

As a side comment: I'm interested in Vala specifically because I like the idea of having language with modern features and clean syntax but compilable into "native" and "unmanaged" high-level language (C in case of Vala). I have found only Vala so far. I'm thinking of having fun by making Vala (or similar language) compilable to C++ (backed by Qt libs). But since I don't want to invent completely new language I'm thinking of taking some existing grammar. Obviously hand-crafted parsers don't have written formal grammar I might reuse. Your comments on this idea are welcome (is the whole idea silly?).

A lot of Vala is really a reflection of decisions made by GObject, and things may or may not work the same way in C++/Qt. If your goal is to replace GObject/C in valac with Qt/C++, you're probably in for more work than you expect. If, however, you just want to make C++ and Qt libraries accessible from Vala, that should certainly be possible. In fact, Luca Bruno started working on that about a year ago (see the wip/cpp branch). It hasn't seen activity for a while due to lack of time, not technical issues.

skagedal
  • 2,323
  • 23
  • 34
nemequ
  • 16,623
  • 1
  • 43
  • 62
  • Awesome! Thanks for the inside. I'll learn more before deciding how to make Qt libs available in Vala. – vladimir Mar 29 '13 at 05:32
0

According to Vala documentation: "Before 0.3.1, Vala's parser was the classic flex scanner and Bison LALR parser combination. But as of Commit eba85a, the parser is a hand-crafted recursive descent parser." My question is: Why?

Here I'm specifically asking about Vala, although the question could be addressed to any compiler which isn't using parser generator. What are pros and cons for such a move from parser generator to hand-crafted parser? What are disadvantages of using parser generators (Bison, ANTLR) for compilers?

Perhaps the programmers spotted some avenues for optimisation that the parser generator didn't spot, and these avenues for optimisation required an entirely different parsing algorithm. Alternatively, perhaps the parser generator generated code in C89, and the programmers decided refactoring for C99 or C11 would improve legibility.

As a side comment: I'm interested in Vala specifically because I like the idea of having language with modern features and clean syntax but compilable into "native" and "unmanaged" high-level language (C in case of Vala).

Just a quick note: C isn't native. It has it's roots in portability, as it was designed to abstract away those hardware/OS-specific details that caused programmers so much grief when porting. For example, think of the pain of using an entirely different fopen for each OS and/or filesystem; I don't just mean different in functionality, but also different in input and output expectations eg. different arguments, different return values. Similarly, C11 introduces portable threads; Code that uses threads will be able to use the same C11-compliant code to target all OSes that implement threads.

I have found Vala so far. I'm thinking of having fun by making Vala (or similar language) compilable to C++ (backed by Qt libs). But since I don't want to invent completely new language I'm thinking of taking some existing grammar. Obviously hand-crafted parsers don't have written formal grammar I might reuse. Your comments on this idea are welcome (is the whole idea silly?).

It may be feasible to use the hand-crafted parser to produce C++ code with little effort, so I wouldn't toss this option aside so quickly; The old flex/bison parser generator may or may not be more useful. However, this isn't your only option. In any case, I'd be tempted to study the specification extensively.

autistic
  • 1
  • 3
  • 35
  • 80
  • Thank you. By "native" I mean "directly compiled to native code", meaning no byte code in a middle, no JIT-compilation, no VM, therefore no GC. – vladimir Mar 28 '13 at 03:10
  • @vladimir What, like [this](http://www.softintegration.com/) or [this](http://root.cern.ch/drupal/content/cint)? By visiting those links you may have gained insightful information: That C may also be interpreted. It might also be insightful to learn that C#, a typically JIT-compiled-to-byte-code language, can also be compiled to machine code and that JIT is an optimisation method that also applies to *native machine code*. You might be interested to learn that C is specified in terms of a VM called the "abstract machine", and that the C specifications don't exclude garbage collection. – autistic Mar 28 '13 at 07:38
  • @vladimir: Javascript was traditionally interpreted, but is now compiled by the V8 engine to IA-32, x86-64, ARM, or MIPS. Release the connection you have established between "translation method" (eg. interpretation, compilation) and "programming language". Code written in one programming language can be compiled (translated) to any other turing complete programming language, or interpreted (translated directly to behaviour). – autistic Mar 28 '13 at 07:42
  • I have to rephrase the part about "native" and "unmanaged". Let's say a language which is cheap at runtime. C# or Java program with empty GUI takes 40Mb at least. Also the GUI might be slow. For example, Qt app for monitoring high-frequency trading took 10 times less memory and was able to update numbers on a screen 4-5 times faster then WPF. And this is not because of me. I'm good in C# and WPF (much better then in Qt). So, what I really meant under "native" and "unmanaged" is "without heavy runtime". Such as C++ compiled by GCC. Although somewhy I don't want to go back to C++ syntax. – vladimir Mar 28 '13 at 12:13
  • @vladimir Which part of the [C# language specification](http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-334.pdf) states that GUI is a requirement? WPF is part of the .NET framework, which also consists of the CLR. Only a few parts of the CLR are required in a valid C# implementation. Furthermore, optimisation is another detail of implementation that is irrelevant to specification. Languages don't have speed; That's an attribute that compilers/interpreters produce... and a C# program that produces the same results as a C program may very well turn into the same machine code. – autistic Mar 28 '13 at 12:45
  • as a programmer I can't consider language without it's compiler and runtime environment and libraries. Regardless of what specification says, boss says we need fast UI application for high-frequency trading. I'm responsible for technology selection. Let's say, I like C# very much (I really do), but will currently EXISTING compilers and libraries fit the current task? So liking or not liking syntax is disconnected from performance of concrete compiler, runtime an libraries. However the whole platform selection fully depends on all these and syntax is the last by the way. – vladimir Mar 28 '13 at 15:11
  • So my dream is C#-like syntax, C(compiled by GCC)-like performance and libraries selection like in Java (or C++). Seems like compiling C# somehow to C/C++ is what I need. Because then I'm getting syntax I like, potentially I can get portability, good performance and thin runtime (with GCC compiler) and some selection of C++ libraries which could be mapped without wrappers to the language. – vladimir Mar 28 '13 at 15:18
  • @vladimir Is the GUI the most significant bottleneck in your application? Which parts of it, specifically? Use a profiler to determine the answer to those two questions. Is it more feasible to use an alternative to that/those specific controls, or to repopulating the controls with values, than to redesign the entire UI in Qt? Research alternative methods of population such as "data binding", or functions such as "BeginUpdate" and "EndUpdate" which might provide performance boosts for large updates. If you can't find anything, look for alternative controls. Is it possible to integrate Qt in C#? – autistic Mar 28 '13 at 15:23
  • @vladimir If the GUI isn't the most significant bottleneck in your application, then why are you focusing on it? Using a "faster programming language" won't *significantly* improve the performance of your solution; You need to look at your algorithm to do that. Whatever your trading program is doing behind the GUI, or to update the GUI, is taking longer than it should... so focus on finding alternatives for that. – autistic Mar 28 '13 at 15:35
  • I know all techniques of optimization mentioned by you. And I'm using them. Also I tried to implement an app on Qt. And figured out that it's faster and less memory consuming. The difference is so big that I started to think how I can get such technology with C# (MS compiler and runtime). Wrapping Qt libs to C# would help but not that much. Because I see that heaviness of the .Net process related not only to the libraries but it also because of heavy runtime. That is how I came to the idea of compiling C# to something different then CIL. – vladimir Mar 28 '13 at 15:47
  • please, don't treat me like I don't know what I'm doing. I'm working on automated market making system for options on the trading floor. My client application is subscribing to low-latency server. Yes - the GUI is a bottleneck of the client application. GUI is the only thing there besides very effective trasnsport library. The .NET runtime just on a start eats more memory then the whole Qt based app. Is'n it weakness of runtime? Not even mentioning the GUI WPF library. – vladimir Mar 28 '13 at 16:03
  • @vladimir You know all of this and yet you don't know why a recursive descent parser would be preferable? ... and yet you're messing around trying to translate Vala to C++ instead of C because you want the Qt toolkit? Well, sorry... I can't help you. – autistic Mar 28 '13 at 16:32
  • Well, I'm not sure to where I want to translate Vala, I'm not even sure if it should be Vala. I just found that Vala implements the idea very similiar to mine and that the idea is good to me. – vladimir Mar 28 '13 at 16:52
  • Not knowing aspects of compiler optimization (moving from parser gen to hand-crafted parser) only means that I don't have hands-on experience in that area. From what I said you can imagine that i have some other unique knowledge in programming area. Finally if that would be that simple there would be no compilers using parser generators and parser generators themselves. Knowing or no knowing things doesn't give a right of speaking arrogantly. Please, respect other opinion and give real arguments instead of "sorry... I can't help you". – vladimir Mar 28 '13 at 17:03
0

It is curious that these authors went from bison to RD. Most people would go in the opposite direction.

The only real reason I can see to do what the Vala authors did would be better error recovery, or maybe their grammar isn't very clean.

I think you will find that most new languages start out with hand-written parsers as the authors get a feel for their own new language and figure out exactly what it is they want to do. Also in some cases as the authors learn how to write compilers. C is a classic example, as is C++. Later on in the evolution a generated parser may be substituted. Compilers for existing standard languages on the other hand can more rapidly be developed via a parser generator, possibly even via an existing grammar: time to market is a critical business parameter of these projects.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • 5
    GCC went the other way, too. They replaced Bison with a hand-written recursive descent parser for C++ in 3.4, and for C and Objective C in 4.1. I'm not trying to make a point or anything, just thought you might find it interesting. – nemequ Mar 29 '13 at 01:40