4

C is parsed strictly in order, i.e. everything has to be declared before it is used; in particular, types must be declared before variables of those types. This makes sense because the grammar would be ambiguous if you didn't know what was the name of a type and what wasn't, e.g. a * b depends on whether a names a type.

On the other hand, some C family languages have the desirable property of relaxing this restriction (thus eliminating manual juggling of header files). I'm writing a parser for a C-superset language which is intended to likewise have that restriction relaxed, so now I need to figure out how to do it.

One method that occurs to me would be to do two passes. The first pass goes through everything, taking advantage of the fact that everything at the top level must be a declaration, not a statement, and picks up all the types. At this stage function bodies are left unexamined, just picked up as token streams delimited by matching braces. The second pass parses function bodies. Local declarations within a function would have to be in order, but that's not really a problem.

Are there any stumbling blocks in that method I haven't thought of?

How do compilers for C++, Java, C# etc. typically handle it for those parts of those languages that don't require declarations in order?

rwallace
  • 31,405
  • 40
  • 123
  • 242

3 Answers3

2

You don't have to do name resolution as you parse. First, if you are designing a "C-like" language (as opposed to a new C implementation), you can define your syntax so that declarations, expressions, methods, etc. are all unambiguous in the syntax. Then parsing order doesn't matter. (This would also fix the preprocessor disease, too, by integrating the preprocessor into the language in a structured way).

If you insist on C-like syntax, you can use a parser that tolerates ambiguity, e.g., is happy to process "x*y;" and hold it as both an expression and declaration until it gets further data. In the extreme case, just think of this as a constraint-based resolution. C and C++ insisted on knowing definitions first because originally compiler memory space was pretty constrained and you couldn't just keep everything; that's no longer true. You don't have to insist on knowing the answer when you parse.

We use GLR parsers for this in our DMS Software Reengineering Toolkit, and it is perfectly happy to parse C and C++11 just fine. We do name resolution after parsing; this isolates parsing and name resolution, making a much cleaner, easier to manage front end.

Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
1

C++ does require in-order declarations.

Keep in mind that C and C++ are a completely different ballgame. They are using positively ancient linker technology (C because it is ancient, C++ because its almost as ancient and compatible with C linkers). Both result in binaries that run directly on the CPU, with no runtime support to speak of.

Java and C# have much improved linkers to rely on, and a huge runtime support system to use.

Either has pros and cons. One of the cons of the C/C++ approach is that everything must be resolved at compile time, because at runtime the application is on its own. The pro is that everything is resolved at compile time, so at runtime the application can be left alone.

DevSolar
  • 67,862
  • 21
  • 134
  • 209
  • Actually that's a good point, C++ doesn't require functions within a class to be declared before they are called, but it does still require types to be declared before they are used. I'm planning to stick to C linker technology, but I hope that doesn't preclude making a few improvements on the parsing side. – rwallace Feb 05 '13 at 14:55
0

almost all compilers do two passes. Another way is to allow declaration of variables in the grammar itself, which would make the grammar a lot harder to parse by hand, but eliminating need for a second pass.

Aniket Inge
  • 25,375
  • 5
  • 50
  • 78
  • It is common to do multiple passes, but not within the parsing stage per se. I'm not sure how you could write the grammar to disambiguate between something that is a variable declaration and something that isn't, without knowing the types beforehand; can you be more specific about what you have in mind? – rwallace Feb 05 '13 at 15:05