6

I've been recently writing parser for language based on C. I'm using CUP (Yacc for Java).

I want to implement "The lexer hack" (http://eli.thegreenplace.net/2011/05/02/the-context-sensitivity-of-c%E2%80%99s-grammar-revisited/ or https://en.wikipedia.org/wiki/The_lexer_hack), to distinguish typedef names and variable/function names etc. To enable declaring variables of the same name as type declared earlier (example from first link):

typedef int AA;

void foo() {
    AA aa;       /* OK - define variable aa of type AA */
    float AA;    /* OK - define variable AA of type float */
}

we have to introduce some new productions, where variable/function name could be either IDENTIFIER or TYPENAME. And this is the moment where difficulties occur - conflicts in grammar.

I was trying not to use this messy Yacc grammar for gcc 3.4 (http://yaxx.googlecode.com/svn-history/r2/trunk/gcc-3.4.0/gcc/c-parse.y), but this time I have no idea how to resolve conflicts on my own. I took a look at Yacc grammar:

declarator:
    after_type_declarator
    | notype_declarator
    ;

after_type_declarator:
    ...
    | TYPENAME
    ;

notype_declarator:
    ...
    | IDENTIFIER
    ;

fndef:
    declspecs_ts setspecs declarator
    // some action code
    // the rest of production
...

setspecs: /* empty */
    // some action code

declspecs_ts means declaration_specifiers where "Whether a type specifier has been seen; after a type specifier, a typedef name is an identifier to redeclare (_ts or _nots)."

From declspecs_ts we can reach

typespec_nonreserved_nonattr:
    TYPENAME
    ...
    ;

At the first glance I can't believe how shift/reduce conflicts does not appear! setspecs is empty, so we have declspecs_ts followed by declarator, so that we can expect that parser should be confused whether TYPENAME is from declspecs_ts or from declarator.

Can anyone explain this briefly (or even precisely). Thanks in advance!

EDIT: Useful link: http://www.gnu.org/software/bison/manual/bison.html#Semantic-Tokens

Grzes
  • 971
  • 1
  • 13
  • 28

1 Answers1

2

I can't speak for the specific code.

But the basic trick is that the C lexer inspects every IDENTIFIER, and decides if might be the name of a typedef. If so, then it changes the lexeme type to TYPEDEF and hands it to the parser.

How is the lexer to know what identifiers are typedefs? The parser must in effect tell it, by capturing typedef information as it runs. Somewhere in the grammar related to declarations, there must be an action to provide this information. I would have expected it to be attached to the grammar rules for, well, typedef declarations.

You didn't show what "setspec" did; maybe that's the place. A common trick used with LR parser generators is to introduce a grammar rule E with an empty right hand (your example "setspec"?), to be invoked in the middle of some other grammar rule (your example "fndef") just to enable access to a semantic action in the middle of processing that rule.

This whole trick is to get around parsing ambiguity if you can't tell typedefs from other identifiers. If your parser tolerates ambiguity, you don't need this hack at all; just parse, and built ASTs with both (sub)parses. After you acquire the AST, a tree walk can find type information and eliminate inconsistent subparses. We do this with GLR for both C and C++, and it beautifully separates parsing from name resolution.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • You're right, but in this case something different happens. You can read `setspec` definition in link above code snippet. I looked a bit deeper in this code. `declspecs_ts` contains EXACTLY ONE `TYPENAME` and often some other specifiers (qualifiers like `INLINE` etc.). There are also other variants like `declspecs_nots` and we have to combine different `declspecs` with `after_type_declarator` and `notype_declaraor` to support all possible combinations. This is even more complicated (because of attribues), nevertheless it is organized/split smart enough to prevent conflicts. – Grzes Jun 20 '13 at 10:09
  • 1
    Welcome to production grammars, which have not only the "standard" langauge constructs, but all the other junk that a specific compiler/dialect (e.g., GCC, MS, GreenHills) throws into the pile. When you add all the extensions that people do, the grammars tend to stop being simple. If you are lucky, the grammar is still "organized/split smart enough" to be manageable. GLR parsers actually make doing this for big complicated grammars quite a bit easier, because don't have to entangle stuff like TYPEDEF checking the grammar itself. – Ira Baxter Jun 20 '13 at 10:13
  • This is how I see it: `declspecs_ts after_type_declaration` is for case when typename is redeclared as variable (possibly of some other typename2), for example: `typedef int myType1; typedef float myType2; { myType1 myType2; /* variable myType2 (name) of type myType1 */ }` `declspecs_nots notype_declaration` is for simple `int foo;` – Grzes Jun 20 '13 at 10:15
  • Ok, can you recommend any GLR parser for Java? – Grzes Jun 20 '13 at 10:16
  • 1
    Consider http://strategoxt.org/Stratego/JSGLR I have no specific experience with this, but the Stratego guys are a pretty good act. Wouldn't surprise me if you found a C grammar nearby. – Ira Baxter Jun 20 '13 at 10:17