1

I'm writing a recursive descent parser for C in C++. I don't know how to choose the right production in the following case:

statement: labeled-statement | compound-statement | expression-statement | selection-statement | iteration-statement | jump-statement

I read about the "first"-set which compares the lookahead-token/char with possible terminals which comes first in the productions. Currently I'm stuck at using a first-set in a recursive descent parser, because I only have a function and nothing else, no object for each rule or anything else with which I can identify a rule/production.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
dcast
  • 111
  • 7

1 Answers1

2

Your grammar is invalid for recursive descent parsers because it's ambiguous on the left side:

  • labeled-statement starts with an identifier
  • compound-statement starts with a { (this is fine)
  • expression-statement starts with an identifier or a number (or ()

Can stop here, you have a clash between labeled statement and expression statement. You need to transform your grammer to get rid of left-side ambiguities (through temporary grammar nodes to contain the common parts so when you branch you can determine which branch to go to using only the look-ahead).

Blindy
  • 65,249
  • 10
  • 91
  • 131
  • mhh sounds good, but what about trying one production after another and excluding these, which doesn't match with its "first"-set (for increased performance). if any production fails, i can backtrack to the last production or last valid state? btw: after i eliminated the left-side ambiguities, i still don't know which production to choose, because i still have no first-set to compare with the lookahead. are there also any good solutions? – dcast Sep 15 '11 at 16:24
  • A backtracking compiler will never work because of its insanely low performance for even small input files. Assuming you really did remove the left-side ambiguities, building the first set is easy (if tedious): just go down the productions for every branch and build a set of the first letters possible. For example, `statement`'s first set is `identifier` (from the expressions and labels), `{` (from the compound statement), `number` (expressions), `goto`, `for`, `if` etc. In your recursive function you just check the lookahead against each branch's first set. – Blindy Sep 15 '11 at 16:43
  • In case it's not obvious, having *any* overlap at all between branches of a single production means your grammar is ambiguous. Also epsilon productions have no place in a recursive descent parser's grammar, you have to get rid of them if you have any. – Blindy Sep 15 '11 at 16:44
  • Fine :), how can I implement a first-set in c++ code? Where to store the different first-sets and so on? How can I prevent from empty productions, because they are generated by eliminating left-recursions :(. – dcast Sep 15 '11 at 17:28
  • They're implemented as simply an array of token types, and you store them as local variables in the recursive functions. And to remove epsilon productions, just google for a nice algorithm, I don't have my dragon book with me and it's been a good 10 years since I last implemented a compiler. – Blindy Sep 15 '11 at 17:43
  • Very big thanks :). I'll try it tomorrow. Which language did your compiler implement? – dcast Sep 15 '11 at 17:52
  • I last implemented a subset of C++ complete with recursion and subclassing for my end of the semester project. – Blindy Sep 15 '11 at 17:57