19

I'm trying to write a C parser, for my own education. I know that I could use tools like YACC to simplify the process, but I want to learn as much as possible from the experience, so I'm starting from scratch.

My question is how I should handle a line like this:

doSomethingWith((foo)(bar));

It could be that (foo)(bar) is a type cast, as in:

typedef int foo;

void doSomethingWith(foo aFoo) { ... }

int main() {
    float bar = 23.6;

    doSomethingWith((foo)(bar));

    return 0;
}

Or, it could be that (foo)(bar) is a function call, as in:

int foo(int bar) { return bar; }

void doSomethingWith(int anInt) { ... }

int main() {
    int bar = 10;

    doSomethingWith((foo)(bar));

    return 0;
}

It seems to me that the parser cannot determine which of the two cases it is dealing with solely by looking at the line doSomethingWith((foo)(bar)); This annoys me, because I was hoping to be able to separate the parsing stage from the "interpretation" stage where you actually determine that the line typedef int foo; means that foo is now a valid type. In my imagined scenario, Type a = b + c * d would parse just fine, even if Type, a, b, c, and d aren't defined anywhere, and problems would only arise later, when actually trying to "resolve" the identifiers.

So, my question is: how do "real" C parsers deal with this? Is the separation between the two stages that I was hoping for just a naive wish, or am I missing something?

Ord
  • 5,693
  • 5
  • 28
  • 42
  • 5
    Hint: It doesn't. C compilers generally have a symbol table with them that contains all symbols identified so far, making it possible to do such a distinction. – fuz Sep 07 '13 at 20:03
  • I don't know your experience and skills, so I may be barking at the wrong tree, but what you are trying to do is fairly difficult (a C parser written in C). If this is your first attempt to write a parser in general, I'd suggest to try with simpler languages first (both as targets and for the implementation). I'd suggest [Lua](http://www.lua.org/), it is simple, dynamic, safe and fast. Moreover, it has been designed to be embedded into applications written in C. IIRC on its WIKI I saw a parser for Lua written in Lua, whose study may be worth for you. – LorenzoDonati4Ukraine-OnStrike Sep 07 '13 at 20:23
  • Unlike you may be thinking, C is really hard to parse... I recommend you start try a more simple language for start. Maybe a BASIC-like language is a good one. And then you add own features in the language. Was as I did it. – Jack Sep 08 '13 at 00:25
  • 1
    @Lorenzo and Jack - I realize that C is not easy to parse, and I recognize that I am undertaking a challenging project, but I'd like to try anyway. Maybe I don't understand the *complete* extent of the difficulties I'll encounter, but hey, live and learn. I have written some very trivial parsers in the past, so I'm not completely new to this. And I do have ulterior motives for choosing C - yes, I want to learn about parsers by writing one, but I also have a need for a customizable C parser in one of my other projects, so I thought I'd kill two birds with one stone. Thanks for your help! – Ord Sep 08 '13 at 17:13

3 Answers3

17

Historically, typedefs were a relatively late addition to C. Before they were added to the language, type names consisted of keywords (int, char, double, struct, etc.) and punctuation characters (*, [], ()), and so were easy to recognize unambiguously. An identifier could never be a type name, so an identifier in parentheses followed by an expression could not be a cast expression.

Typedefs made it possible for a user-defined identifier to be a type name, which rather seriously messed up the grammar.

Take a look at the syntax of type-specifier in the C standard (I'll use the C90 version since it's slightly simpler):

type-specifier:
void
char
short
int
long
float
double
signed
unsigned
struct-or-union-specifier
enum-specifier
typedef-name

All but the last can be easily recognized because they either are keywords, or start with a keyword. But a typedef-name is just an identifier.

When a C compiler processes a typedef declaration, it needs to, in effect, introduce the typedef name as a new keyword. Which means that, unlike for a language with a context-free grammar, there needs to be feedback from the symbol table to the parser.

And even that's a bit of an oversimplification. A typedef name can still be redefined, either as another typedef or as something else, in an inner scope:

{
    typedef int foo; /* foo is a typedef name */
    {
        int foo;     /* foo is now an ordinary identifier, an object name */
    }
                     /* And now foo is a typedef name again */
}

So a typedef name is effectively a user-defined keyword if it's used in a context where a type name is valid, but is still an ordinary identifier if it's redeclared.

TL;DR: Parsing C is hard.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
13

What you're talking about is a "context-free grammar", where you can parse everything without having to remember what's a type and what's a variable (or, in general, use any semantic attributes associated with an identifier). C, unfortunately, is not context-free, so you don't have that luxury.

Sneftel
  • 40,271
  • 12
  • 71
  • 104
  • 3
    Yep. And this is a [relevant link](http://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-of-cs-grammar/). – LorenzoDonati4Ukraine-OnStrike Sep 07 '13 at 20:14
  • 1
    @LorenzoDonati Thanks for the link - that page and its follow-ups (http://eli.thegreenplace.net/2011/05/02/the-context-sensitivity-of-c%E2%80%99s-grammar-revisited/ and http://eli.thegreenplace.net/2012/07/05/how-clang-handles-the-type-variable-name-ambiguity-of-cc/) really helped me to get a handle on things. – Ord Sep 08 '13 at 17:04
  • This is why you have to declare things in header files in C - just checking at link time that all the right functions and types exist will not work. – jwg Apr 23 '14 at 12:39
3

Virtually no modern language is context free (e.g, can have the meaning of a phrase determined entirely locally).

The smart money is to build a context-free parser, and resolve context-dependencies later, isolating the two tasks.

Thus the question of "how does the parser know type cast from function call" becomes a non-topic; the only reason it exists is that people insist on tangling raw parsing with name and type resolution.

For a cleaner model, consider using a GLR parser. See this SO answer for more detail, using the problem of resolving what

 x*y;

means in C, the same problem for OP, if he hasn't tripped over it yet.

Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341