3

I'm working on a compiler (language close to C) and I've to implement it in C. My main question is how to choose the right parsing method in order to be efficient while coding my compiler.

Here's my current grammar: http://img11.hostingpics.net/pics/273965Capturedcran20130417192526.png

I was thinking about making a top-down parser LL(1) as described here: http://dragonbook.stanford.edu/lecture-notes/Stanford-CS143/07-Top-Down-Parsing.pdf

Could it be an efficient choice considering this grammar, knowing that I first have to remove the left recursive rules. Do you have any other advices?

Thank you, Mentinet

mentinet
  • 744
  • 3
  • 9
  • 23
  • 1
    Do you need to be efficient in terms of coding your parser, or would you like the code to be efficient in terms of parsing things? In my nearly two-decade long experience of writing things that require parsers, it has never been the case that the speed of parsing was important to a measurable degree. The things I did with the parsed AST have always been the key factor in determining the performance. – Sergey Kalinichenko Apr 17 '13 at 17:38
  • The most important thing for me is the be efficient in term of coding because we (I'm workin in pair) already made lots of work in Clean but we are not experienced with this language and it's really hard to go further now. So we are a bit late and the quicker we'll have something that works, the better. We'll also then have to build an AST. – mentinet Apr 17 '13 at 17:55
  • 1
    I would use ANTLR then: it builds very good recursive descent parsers, has a C target, and is very easy to learn. – Sergey Kalinichenko Apr 17 '13 at 18:02
  • The thing is that we can use librairies but they don't have to do all the job. So I think I'm gonna try to implement a recursive-descent parser. – mentinet Apr 18 '13 at 07:59
  • 2
    Implement your own Packrat parsing: it is pretty trivial and very easy to use - PEGs are nearly intuitive. – SK-logic Apr 18 '13 at 09:13
  • 1
    I second using PEG/Packrat. It just makes everything so much easier, so most of your time can be spent on code interpretation/generation instead of fighting the grammar and the parser generator. I would understand that you are required to have the generated parser be in C, but not the parser generator. If that's the case, you can take a look at PegJS, PyParsing, or Grako. If you must use C, then there's [Rats](http://cs.nyu.edu/rgrimm/xtc/). – Apalala Apr 18 '13 at 18:53

4 Answers4

3

Lots of answers here, but they get things confused. Yes, there are LL and LR parsers, but that isn't really your choice.

You have a grammar. There are tools that automatically create a parser for you given a grammar. The venerable Yacc and Bison do this. They create an LR parser (LALR, actually). There are also tools that create an LL parser for you, like ANTLR. The downsides of tools like this are they inflexible. Their automatically generated syntax error messages suck, error recovery is hard and the older ones encourage you to factor your code in one way - which happens to be the wrong way. The right way is to have your parser spit out an Abstract Syntax Tree, and then have the compiler generate code from that. The tools want you to mix parser and compiler code.

When you are using automated tools like this the differences in power between LL, LR and LALR really does matter. You can't "cheat" to extend their power. (Power in this case means being able to generate a parser for valid context free grammar. A valid context free grammar is one that generates a unique, correct parse tree for every input, or correctly says it doesn't match the grammar.) We currently have no parser generator that can create parser for every valid grammar. However LR can handle more grammars than any other sort. Not being able to handle a grammar isn't a disaster as you can re-write the grammar in a form the parser generator can accept. However, it isn't always obvious how that should be done, and worse it effects the Abstract Syntax Tree generated which means weaknesses in the parser ripple down through the rest of your code - like the compiler.

The reason there are LL, LALR and LR parsers is a long time ago, the job of generating a LR parser was taxing for a modern computer both in terms of time and memory. (Note this is the it takes the generate the parser, which only happens when you write it. The generated parser runs very quickly.) But that was a looong time ago. Generating a LR(1) parser takes far less than 1GB of RAM for a moderately complex language and on a modern computer takes under a second. For that reason you are far better off with an LR automatic parser generator, like Hyacc.

The other option is you write your own parser. In this case there is only one choice: an LL parser. When people here say writing LR is hard, they understate the case. It is near impossible for a human to manually create an LR parser. You might think this means if you write your own parser you are constrained to use LL(1) grammars. But that isn't quite true. Since you are writing the code, you can cheat. You can lookahead an arbitrary number of symbols, and because you don't have to output anything till you are good and ready the Abstract Syntax Tree doesn't have to match the grammar you are using. This ability to cheat makes up for all of lost power between LL and LR(1), and often then some.

Hand written parsers have their downsides of course. There is no guarantee that your parser actually matches your grammar, or for that matter no checking if your grammar is valid (ie recognises the language you think it does). They are longer, and they are even worse at encouraging you to mix parsing code with compile code. They are also obviously implemented in only one language, whereas a parser generator often spit out their results in several different languages. Even if they don't, an LR parse table can be represented in a data structure containing only constants (say in JSON), and the actual parser is only 100 lines of codes or so. But there are also upsides to hand written parser. Because you wrote the code, you know what is going on, so it is easier to do error recovery and generate sane error messages.

In the end, the tradeoff often works like this:

  • For one off jobs, you are far better using a LR(1) parser generator. The generator will check your grammar, save you work, and modern ones split out the Abstract Syntax Tree directly, which is exactly what you want.
  • For highly polished tools like mcc or gcc, use a hand written LL parser. You will be writing lots of unit tests to guard your back anyway, error recovery and error messages are much easier to get right, and they can recognise a larger class of languages.

The only other question I have is: why C? Compilers aren't generally time critical code. There are very nice parsing packages out there that will allow you to get the job done in 1/2 the code if you willing to have your compiler run a fair bit slower - my own Lrparsing for instance. Bear in mind a "fair bit slower" here means "hardly noticeable to a human". I guess the answer is "the assignment I am working on specifies C". To give you an idea, here is how simple getting from your grammar to parse tree becomes when you relax the requirement. This program:

#!/usr/bin/python

from lrparsing import *

class G(Grammar):
  Exp = Ref("Exp")
  int = Token(re='[0-9]+')
  id = Token(re='[a-zA-Z][a-zA-Z0-9_]*')
  ActArgs = List(Exp, ',', 1)
  FunCall = id + '(' + Opt(ActArgs) + ')'
  Exp = Prio(
      id | int | Tokens("[]", "False True") | Token('(') + List(THIS, ',', 1, 2) + ')' |
      Token("! -") + THIS,
      THIS << Tokens("* / %") << THIS,
      THIS << Tokens("+ -") << THIS,
      THIS << Tokens("== < > <= >= !=") << THIS,
      THIS << Tokens("&&") << THIS,
      THIS << Tokens("||") << THIS,
      THIS << Tokens(":") << THIS)
  Type = (
      Tokens("", "Int Bool") |
      Token('(') + THIS + ',' + THIS + ')' |
      Token('[') + THIS + ']')
  Stmt = (
      Token('{') + THIS * Many + '}' |
      Keyword("if") + '(' + Exp + ')' << THIS + Opt(Keyword('else') + THIS) |
      Keyword("while") + '(' + Exp + ')' + THIS |
      id + '=' + Exp + ';' |
      FunCall + ';' |
      Keyword('return') + Opt(Exp) + ';')
  FArgs = List(Type + id, ',', 1)
  RetType = Type | Keyword('void')
  VarDecl = Type + id + '=' + Exp + ';'
  FunDecl = (
      RetType + id + '(' + Opt(FArgs) + ')' +
      '{' + VarDecl * Many + Stmt * Some + '}')
  Decl = VarDecl | FunDecl
  Prog = Decl * Some
  COMMENTS = Token(re="/[*](?:[^*]|[*][^/])*[*]/") | Token(re="//[^\n\r]*")
  START = Prog

EXAMPLE = """\
Int factorial(Int n) {
  Int result = 1;
  while (n > 1) {
    result = result * n;
    n = n - 1;
  }
  return result;
}
"""

parse_tree = G.parse(EXAMPLE)
print G.repr_parse_tree(parse_tree)

Produces this output:

(START (Prog (Decl (FunDecl
  (RetType (Type 'Int'))
  (id 'factorial') '('
  (FArgs
    (Type 'Int')
    (id 'n')) ')' '{'
  (VarDecl
    (Type 'Int')
    (id 'result') '='
    (Exp (int '1')) ';')
  (Stmt 'while' '('
    (Exp
      (Exp (id 'n')) '>'
      (Exp (int '1'))) ')'
    (Stmt '{'
      (Stmt
        (id 'result') '='
        (Exp
          (Exp (id 'result')) '*'
          (Exp (id 'n'))) ';')
      (Stmt
        (id 'n') '='
        (Exp
          (Exp (id 'n')) '-'
          (Exp (int '1'))) ';') '}'))
  (Stmt 'return'
    (Exp (id 'result')) ';') '}'))))
Russell Stuart
  • 590
  • 5
  • 11
2

The most efficient way to build a parser is to use a specific tool which purpose of existance is to build parsers. They used to be called compiler compilers, but nowadays the focus has shifted (broadened) to language workbenches which provide you with more aid to build your own language. For instance, almost any language workbench would provide you with IDE support and syntax highlighting for your language right off the bat, just by looking at a grammar. They also help immensely with debugging your grammar and your language (you didn’t expect left recursion to be the biggest of your problems, did you?).

Among the best currently supported and developing language workbenches one could name:

If you really so inclined, or if you consider writing a parser yourself just for amusement and experience, the best modern algorithms are SGLR, GLL and Packrat. Each one of those is a quintessence of algorithmic research that lasted half a century, so do not expect to understand them fully in a flash, and do not expect any good to come out of the first couple of “fixes” you’ll come up with. If you do come up with a nice improvement, though, do not hesitate to share your findings with the authors or publish it otherwise!

grammarware
  • 120
  • 6
1

Thank you for all those advices but we finally decided to build our own recursive-descent parser by using exactly the same method as here: http://www.cs.binghamton.edu/~zdu/parsdemo/recintro.html

Indeed, we changed the grammar in order to remove the left-recursive rules and because the grammar I showed in my first message isn't LL(1), we used our token list (made by our scanner) to proceed a lookahead which go further. It looks that it works quite well.

Now we have the build an AST within those recursive functions. Would you have any suggestions? Tips? Thank you very much.

mentinet
  • 744
  • 3
  • 9
  • 23
  • 1
    You should not embed another question in your answer, as this is likely to get deleted as "not an answer". Ask another separate question and [edit] this one to be just an answer, which you can "accept". – Brian Tompsett - 汤莱恩 Jun 13 '15 at 18:16
0

The most efficient parsers are LR-Parsers and LR-parsers are bit difficult to implement .You can go for recursive descent parsing technique as it is easier to implement in C.

Santhosh Pai
  • 2,535
  • 8
  • 28
  • 49
  • 1
    I question all those statements. I'm not aware that LR parsing is significantly more efficient than LL. It is however more *powerful.* LR parsers are constructed by parser generators, which is easy, where some LL parsers such as recursive descent must be constructed by hand, which is considerably *more* difficult, as well as being much more error-prone. – user207421 Apr 17 '13 at 22:00
  • Moreover, we have to write a code by ourselves so we can use librairies but not a program that would generate the code for us. Then considering the grammar, what are the things that have to be changed, besides the left-recursity. Some advices about it ? Thank you very much. – mentinet Apr 18 '13 at 08:00
  • 2
    @EJP, LR parsers are not "more powerful", because it is really hard to get an infinite lookahead with them (and trivial with LL). – SK-logic Apr 18 '13 at 09:12
  • LR parsers are more efficient in terms of memory use, but only if there's special management for the parsing tables. LR parsers are the least efficient in terms of programmer effort. – Apalala Apr 18 '13 at 18:55
  • @SK-logic LR parsers are more powerful because they parse on right context, not left context. LR(1) > LL(1). You don't need infinite look ahead. – user207421 Apr 21 '13 at 22:53
  • @Apalala I have already disputed that assertion. That's what parser generators are for. – user207421 Apr 21 '13 at 22:54
  • @EJP, mind explaining? There are grammars that *do* need infinite lookahead and they cannot be represented in LR(n). – SK-logic Apr 21 '13 at 22:59
  • @EJP Parsers generators don't know how debug grammars, nor themselves. A shift-reduce conflict is something the parser generator understand very well, but the implementer not. And that's all I have to say about that! – Apalala Apr 22 '13 at 03:36
  • @SK-logic As LR(k) is a superset of LL(k), those grammars can't be represented in LL(k) either. – user207421 Apr 24 '13 at 10:37
  • @Apalala The relevance of your comment escapes me. – user207421 Apr 24 '13 at 10:37
  • @EJP, I'm talking about LL(infinity) and its superset, PEG. – SK-logic Apr 24 '13 at 11:14