Parsing a C family language out of order

Question

C is parsed strictly in order, i.e. everything has to be declared before it is used; in particular, types must be declared before variables of those types. This makes sense because the grammar would be ambiguous if you didn't know what was the name of a type and what wasn't, e.g. a * b depends on whether a names a type.

On the other hand, some C family languages have the desirable property of relaxing this restriction (thus eliminating manual juggling of header files). I'm writing a parser for a C-superset language which is intended to likewise have that restriction relaxed, so now I need to figure out how to do it.

One method that occurs to me would be to do two passes. The first pass goes through everything, taking advantage of the fact that everything at the top level must be a declaration, not a statement, and picks up all the types. At this stage function bodies are left unexamined, just picked up as token streams delimited by matching braces. The second pass parses function bodies. Local declarations within a function would have to be in order, but that's not really a problem.

Are there any stumbling blocks in that method I haven't thought of?

How do compilers for C++, Java, C# etc. typically handle it for those parts of those languages that don't require declarations in order?

Yes, preprocessing will still have to be done as a separate initial step. — rwallace, Feb 05 '13 at 14:56
Well, I abandoned this method because of the ambiguity presented by `a * b`. Btw, some other things to consider include incomplete types. It may not be right to first collect all types and then apply them. What if that works, but the code doesn't actually compile because at some point a type is still incomplete? — Alexey Frunze, Feb 05 '13 at 15:53
Then an error message would be generated in the second pass. — rwallace, Feb 05 '13 at 17:01
If you preserve enough information about the order of things. — Alexey Frunze, Feb 05 '13 at 17:02

score 2 · Accepted Answer · edited May 23 '17 at 11:48

You don't have to do name resolution as you parse. First, if you are designing a "C-like" language (as opposed to a new C implementation), you can define your syntax so that declarations, expressions, methods, etc. are all unambiguous in the syntax. Then parsing order doesn't matter. (This would also fix the preprocessor disease, too, by integrating the preprocessor into the language in a structured way).

If you insist on C-like syntax, you can use a parser that tolerates ambiguity, e.g., is happy to process "x*y;" and hold it as both an expression and declaration until it gets further data. In the extreme case, just think of this as a constraint-based resolution. C and C++ insisted on knowing definitions first because originally compiler memory space was pretty constrained and you couldn't just keep everything; that's no longer true. You don't have to insist on knowing the answer when you parse.

We use GLR parsers for this in our DMS Software Reengineering Toolkit, and it is perfectly happy to parse C and C++11 just fine. We do name resolution after parsing; this isolates parsing and name resolution, making a much cleaner, easier to manage front end.

score 1 · Answer 2 · answered Feb 05 '13 at 14:49

C++ does require in-order declarations.

Keep in mind that C and C++ are a completely different ballgame. They are using positively ancient linker technology (C because it is ancient, C++ because its almost as ancient and compatible with C linkers). Both result in binaries that run directly on the CPU, with no runtime support to speak of.

Java and C# have much improved linkers to rely on, and a huge runtime support system to use.

Either has pros and cons. One of the cons of the C/C++ approach is that everything must be resolved at compile time, because at runtime the application is on its own. The pro is that everything is resolved at compile time, so at runtime the application can be left alone.

Actually that's a good point, C++ doesn't require functions within a class to be declared before they are called, but it does still require types to be declared before they are used. I'm planning to stick to C linker technology, but I hope that doesn't preclude making a few improvements on the parsing side. — rwallace, Feb 05 '13 at 14:55

score 0 · Answer 3 · answered Feb 05 '13 at 14:47

0

almost all compilers do two passes. Another way is to allow declaration of variables in the grammar itself, which would make the grammar a lot harder to parse by hand, but eliminating need for a second pass.

answered Feb 05 '13 at 14:47

Aniket Inge

25,375
5
50
78

It is common to do multiple passes, but not within the parsing stage per se. I'm not sure how you could write the grammar to disambiguate between something that is a variable declaration and something that isn't, without knowing the types beforehand; can you be more specific about what you have in mind? – rwallace Feb 05 '13 at 15:05

Parsing a C family language out of order

3 Answers3