16

I'm trying to learn how to make a compiler. In order to do so, I read a lot about context-free language. But there are some things I cannot get by myself yet.

Since it's my first compiler there are some practices that I'm not aware of. My questions are asked with in mind to build a parser generator, not a compiler neither a lexer. Some questions may be obvious..

Among my reads are : Bottom-Up Parsing, Top-Down Parsing, Formal Grammars. The picture shown comes from : Miscellanous Parsing. All coming from the Stanford CS143 class.

Parsers / Grammars Hierarchy

Here are the points :

0) How do ( ambiguous / unambiguous ) and ( left-recursive / right-recursive ) influence the needs for one algorithm or another ? Are there other ways to qualify a grammar ?

1) An ambiguous grammar is one that have several parse trees. But shouldn't the choice of a leftmost-derivation or rightmost-derivation lead to unicity of the parse tree ?

[EDIT: Answered here ]

2.1) But still, is the ambiguity of the grammar related to k ? I mean giving a LR(2) grammar, is it ambiguous for a LR(1) parser and not ambiguous for a LR(2) one ?

[EDIT: No it's not, a LR(2) grammar means that the parser will need two tokens of lookahead to choose the right rule to use. On the other hand, an ambiguous grammar is one that possibly leads to several parse trees. ]

2.2) So a LR(*) parser, as long as you can imagine it, will have no ambiguous grammar at all and can then parse the entire set of context free languages ?

[EDIT: Answered by Ira Baxter, LR(*) is less powerful than GLR, in that it can't handle multiple parse trees. ]

3) Depending on the previous answers, what follows may be self contradictory. Considering LR parsing, do ambiguous grammars trigger shift-reduce conflict ? Can an unambiguous grammar trigger one too ? In the same way, what about reduce-reduce conflicts ?

[EDIT: this is it, ambiguous grammars leads to shift-reduce and reduce-reduce conflicts. By contrapositive, if there are no conflicts, the grammar is univocal. ]

4) The ability to parse left-recursive grammar is an advantage of LR(k) parser over LL(k), is it the only difference between them ?

[EDIT: yes. ]

5) Giving G1 :

G1 :
S -> S + S
S -> S - S
S -> a

5.1) G1 is both left-recursive, right-recursive, and ambiguous, am I right ? Is it a LR(2) grammar ? One would make it unambiguous :

G2 :
S -> S + a
S -> S - a
S -> a

5.2) Is G2 still ambiguous ? Does a parser for G2 needs two lookaheads ? By factorisation we have :

G3 :
S -> S V
V -> + a
V -> - a
S -> a

5.3) Now, does a parser for G3 need one lookahead only ? What are the counter parts for doing these transformations ? Is LR(1) the minimal parser required ?

5.4) G1 is left recursive, in order to parse it with a LL parser, one need to transform it into a right recursive grammar :

G4 :
S -> a + S
S -> a - S
S -> a

then

G5 :
S -> a V
V -> - V
V -> + V
V -> a

5.5) Does G4 need at least a LL(2) parser ? G5 only is parsable by a LL(1) parser, G1-G5 do define the same language, and this language is ( a (+/- a)^n ). Is it true ?

5.6) For each grammar G1 to G5, what is the minimal set to which it belongs ?

6) Finally, since many differents grammars may define the same language, how does one chose the grammar and the associated parser ? Is the resulting parse tree imortant ? What is the influence of the parse tree ?

I'm asking a lot, and I don't really expect a complete answer, anyway any help would be very appreciated.

Thx for reading !

Community
  • 1
  • 1
dader
  • 1,304
  • 1
  • 12
  • 31

1 Answers1

11

"Many grammars may define the same langauge, how does one choose..."?

Usually, you choose the one that meets the following criteria:

  • conceptually as simple as you can make it (implication: smaller than others)
  • tracks the terminology in the langauge reference manual where possible
  • least amount of bending to meet the constraints of your parser generator

That last one can make a mess of your conceptual simplicity, and your chart of various parser styles shows the number of different issues that you face depending on your choice-of-generator. This is aggravated by the fact that choice is often made well before you actually choose the grammar.

One way to minimize grammar bending is to choose a parser generator which handles fully context-free grammars. GLR parsing has this very significant advantage. I've been using it for 15 years and have done dozens of real langauges with it.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • thx. So using GLR, it would be able to parse any CFG, with a grammar as simple as it could be, giving a similarily simple parse tree. Then arise a question : does GLR = LR(*) ? Moreover, by using GLR parser you will have no need for your grammar to reduce the amount of bending, right ? – dader Jun 17 '11 at 14:21
  • 1
    Technically yes. There are CFGs which cause GLR to have exponential behavior, and you thus still have to bend some. As a general rule, this behavior is pretty rare. You will find as you build parsers that sometimes you want to add semantic constraints that are outside what CFG can do (consider matching multiple Fortran DO loop heads to the same CONTINUE statement by matching the line number), and so you'll still have to bend the grammar some. But ultimately, you bend the grammar a LOT less with GLR. Yes, GLR has "infinite lookahead", it can do anything LR(*) can do. – Ira Baxter Jun 17 '11 at 14:28
  • ok for GLR doing whatever LR( * ) can do, but I meant the opposite, does LR( * ) handle the full set of CFGs as GLR does ? I'm asking because the answer will induce the one of the point 2 : does the set of LR(*) grammars equals ( include and is included by ) the set of all CFGs ? – dader Jun 17 '11 at 14:44
  • 1
    LR(*) does not handle ambiguous grammars. Many real languages (including C++, see http://stackoverflow.com/questions/243383/why-c-cannot-be-parsed-with-a-lr1-parser/1004737#1004737) are ambiguous if you stick to only the grammar. For such languages, it really is often easier to parse it and collect the ambiguities, and eliminate those ambiguities when the parse phase is complete. Otherwise you end with something like the classic C/C++ compiler hack in which symbol table (context sensitive) data is fed into the lexer/parser, making a real mess. – Ira Baxter Jun 17 '11 at 15:12
  • Ok, i have a better understanding for this part now, thanks a lot Ira Baxter ! – dader Jun 17 '11 at 15:42
  • Considering the C++ example : x * y, ambiguity is resolved thanks to the contextual informations from the symbols table. But if it was to be parsed by an LR parser, what kind of error or conflict would arise in this specific case ? And what would be the grammar rules that trigger it ? – dader Jun 18 '11 at 02:56
  • 1
    @dader51: Grammar rules triggering it: "stmt = exp ;" (for x*y as variable x multiplied by variable y) and "stmt = declaration;" (for x as a type, x* as pointer to an x-type, and y as a variable being declared as an x*-type). LR parser issue: the grammar rules for exp and those for declaration will both claim they see valid strings of tokens " x * y ; ". At that point, the parser has a reduce-reduce conflict somewhere. This means the LR parser can't decide which production to reduce-by, which is what happens when you have amibiguity in the grammar. – Ira Baxter Jun 18 '11 at 04:04
  • 2
    Just a small note for anyone who might end up here: [the BRNGLR algorithm](http://link.springer.com/article/10.1007%2Fs00236-007-0054-z), a variant of the GLR algorithm written by Elizabeth Scott, fixes the exponential behavior, being worst-case cubic for any context-free grammar. – paulotorrens Jun 12 '16 at 07:52