3

I'm looking for a context-free grammar parser generator with grammar/code separation and a possibility to add support for new target languages. For instance if I want parser in Pascal, I can write my own pascal code generator without reimplementing the whole thing.

I understand that most open-source parser generators can in theory be extended, still I'd prefer something that has extendability planned and documented.

Feature-wise I need the parser to at least support Python-style indentation, maybe with some additional work. No requirement on the type of parser generated, but I'd prefer something fast.

Which are the most well-known/maintained options?

Popular parser-generators seem to mostly use mixed grammar/code approach which I really don't like. Comparison list on Wikipedia lists a few but I'm a novice at this and can't tell which to try.

Why I don't like mixing grammar/code: because this approach seems like a mess. Grammar is grammar, implementation details are implementation details. They're different things written in different languages, it's intuitive to keep them in separate places.

What if I want to reuse parts of grammar in another project, with different implementation details? What if I want to compile a parser in a different language? All of this requires grammar to be kept separate.

himself
  • 4,806
  • 2
  • 27
  • 43
  • Why don't you like mixed grammar/code? Do you want to build a Concrete Syntax Tree and then separately transform that into an Abstract Syntax Tree, for example? – Robin Green May 04 '11 at 18:24
  • I never understood the objection against mixing parser rules and semantic actions. Creating a parse tree and then traversing it is much slower, and it doesn't allow for syntax flexibility (e.g. simplest case: match a number only if it satisfies a given predicate or lambda). I also don't see it useful to create a separate parser, which outputs in many languages. You always need just one and the most efficient one. – Gene Bushuyev May 31 '11 at 01:28
  • @Gene Bushuyev: I would be satisfied with calling virtual functions and letting me do the job in parser descendant. This is almost as efficient as mixing parsing/implementation, but much cleaner and, uhm, logical. Rules and how exactly are they implemented are different things, aren't they? – himself May 31 '11 at 12:26
  • As for many languages, I need just the one I need to make parser for. And if you're not C++ or Java, options are suddenly very limited for you. You're basically stuck with whatever one or two parsers are available for your language, no matter how unfortunate they are. That's why I want something extensible. I'm okay with writing a back-end on my own. – himself May 31 '11 at 12:30
  • Nope, I don't agree with these assertions. Parsers using virtual functions are usually very slow, because many syntax rules are very short and checked very often. I wouldn't even go there. Second, I never seen a parser for the same syntax reused with different semantics. Even when in rare case when it's feasible, it's terribly inefficient and actually results in convoluted and error-prone code. Adding semantic actions to the syntax rules makes code cleaner, easier to understand, easier to debug, and comes with the best possible performance. And I have done many parsers both ways. – Gene Bushuyev Jun 02 '11 at 21:32
  • Im Sorry, I make comment months later. I have made parsers in Pascal (Object/Free/Delphi) & made a Scanner Generator in Pascal. What I mean its possible to make Compiler Tools in the modern dialects of Pascal ... – umlcat Dec 09 '13 at 22:33

2 Answers2

3

Most parser generators won't handle context-free grammars. They handle some subset (LL(1), LL(k), LL(*), LALR(1), LR(k), ...). If you choose one of these, you will almost certainly have to hack your grammar to match the limitations of the parser generator (no left recursion, limited lookahead, ...). If you want a real context free parser generator you want an Early parser generator (inefficient), a GLR parser generator (the most practical of the lot), or a PEG parser generator (and the last isn't context-free; it requires rules to be ordered to determine which ones take precedence).

You seem to be worried about mixing syntax and parser-actions used to build the trees.

If the tree you build isn't a direct function of the syntax, there has to be some way to tie the tree-building machinery to the grammar productions. Placing it "near" the grammar production is one way, but leads to your "mixed" notation objection.

Another way is to give each rule a name (or some unique identifier), and set the tree-building machinery off to the side indexed by the names. This way your grammar isn't contaminated with the "other stuff", which seems to be your objection. None of the parser generator systems I know of do this. An awkward issue is that you now have to invent lots of rule names, and anytime you have a few hundred names that's inconvenient by itself and it is hard to make them mnemonic.

A third way is to make the a function of the syntax, and auto-generate the tree building steps. This requires no extra stuff off to the side at all to produce the ASTs. The only tool I know that does it (there may be others but I've been looking for 20 odd years and haven't seen one) is my company's product,, the DMS Software Reengineering Toolkit. [DMS isn't just a parser generator; it is a complete ecosystem for building program analysis and transformation tools for arbitrary languages, using a GLR parsing engine; yes it handles Python style indents].

One objection is that such trees are concrete, bloated and confusing; if done right, that's not true. My SO answer to this question: What is the difference between an Abstract Syntax Tree and a Concrete Syntax Tree? discusses how we get the benefits of ASTs from automatically generated compressed CSTs.

The good news about DMS's scheme is that the basic grammar isn't bloated with parsing support. The not so good news is that you will find lots of other things you want to associate with grammar rules (prettyprinting rules, attribute computations, tree synthesis,...) and you come right back around to the same choices. DMS has all of these "other things" and solves the association problem a number of ways:

  • By placing other related descriptive formalisms next to the grammar rule (producing the mixing you complained about). We tolerate this for pretty-printing rules because in fact it is nice to have the grammar (parse) rule adjacent to the pretty-print (anti-parse) rule. We also allow attribute computations to be placed near the grammar rules to provide an association.

  • While DMS allows rules to have names, this is only for convenient access by procedural code, not associating other mechanisms with the rule.

  • DMS provides a third way to associate these mechanisms (esp. attribute grammar computations) by using the rule itself as a kind of giant name. So, you write the grammar and prettyprint rules in one place, and somewhere else you can write the grammar rule again with an associated attribute computation. In principle, this is just like giving each rule a name (well, a signature) and associating the computation with the name. But it also allows us to define many, many different attribute computations (for different purposes) and associate them with their rules, without cluttering up the base grammar. Our tools check that a (rule,associated-computation) has a valid rule in the base grammar, so it makes it relatively each to track down what needs fixing when the base grammar changes.

This being my tool (I'm the architect) you shouldn't take this as a recommendation, just a bias. That bias is supported by DMS's ability to parse (without whimpering) C, C++, Java, C#, IBM Enterprise COBOL, Python, F77/F90/F95 with column6 continues/F90 continues and embedded C preprocessor directives to boot under most circumstances), Mumps, PHP4/5 and many other languages.

Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • Note that LL(K) and LR(K) are generally good enough for most grammars, if you have control over how big the K is. – T.E.D. May 05 '11 at 13:01
  • @T.E.D.: They won't handle ambiguous grammars or those requiring infinite lookahead. LL(K) won't handle left recursion. You can't find many LR(k) parser generators. GLR wins as a practical answer. – Ira Baxter May 11 '11 at 04:35
  • @Ira Baxter - That's more of a theoretical issue than a real-world one, unless you are trying to parse a human language or something. Most programming languages can be handled just fine by an LL(K). Heck, most can be handled with a wee bit of work by YACC, and its only LR(1). – T.E.D. May 11 '11 at 15:33
  • @T.E.D.: Wee bit of work? You can make any parser generator process pretty much any langauge. The way you do it is you let the generated parser "accept too much" and add post-processing to eliminate what the parser generator could not. While you can make this work, it is often a *lot* more work that you might expect. The GCC famously gave up parsing C++ with Bison (YACC). The ANTLR guys (LL(inf?)) don't have a working C++ parser. ... – Ira Baxter May 11 '11 at 16:00
  • @T.E.D: ... Try parsing FORTRAN with any of these, and getting the loop structures (with shared loop termination statements) right, or parsing COBOL and getting the data declaration nesting right (COBOL numbers the nesting depth of structure elements rather than using brackets). So, yes, you can do it. What GLR brings to the table is the possibility to do it well at significantly less effort. DMS with its huge variety of front ends is a perfect example of that. DMS has a lot of non-GLR machinery to help so GLR isn't the complete answer. ... – Ira Baxter May 11 '11 at 16:03
  • @T.E.D: ... But I've been been looking for/using parser generator for most of my 40 year career, and GLR has been by far my best choice. YMMV but I'd be very suprised. – Ira Baxter May 11 '11 at 16:04
1

First off, any decent parser generator is going to be robust enough to support Python's indenting. That isn't really all that weird as languages go. You should try parsing column-sensitive languages like Fortran77 some time...

Secondly, I don't think you really need the parser itself to be "extensible" do you? You just want to be able to use it to lex and parse the language or two you have in mind, right? Again, any decent parser-generator can do that.

Thirdly, you don't really say what about the mix between grammar and code you don't like. Would you rather it be all implemented in a meta-language (kinda tough), or all in code?

Assuming it is the latter, there are a couple of in-language parser generator toolkits I know of. The first is Boost's Spirit, which is implemented in C++. I've used it, and it works. However, back when I used it you pretty much needed a graduate degree in "boostology" to be able to understand its error messages well enough to get anything working in a reasonable amount of time.

The other I know about is OpenToken, which is a parser-generation toolkit implemented in Ada. Ada doesn't have the error-novel problem that C++ has with its templates, so OpenToken is far easier to use. However, you have to use it in Ada...

Typical functional languages allow you to implement any sublanguage you like (mostly) within the language itself, thanks to their inhernetly good support for things like lambdas and metaprogramming. However, their parsers tend to be slower. That's really no problem at all if you are just parsing a configuration file or two. Its a tremendous problem if you are parsing hundreds of files at a go.

T.E.D.
  • 44,016
  • 10
  • 73
  • 134