Narrow Parsing in a Compiler

Question

I'm working on a compiler for a C++-like language (assume that we'll be compiling C++ for now, and do not consider The Lexer Hack). So far the stages from stream to lexer can be narrow; the parser calls getToken which may call getCharacter. This is contrasted to the "broad" technique which would load the entire stream into memory and then convert the stream entirely into tokens before moving to parsing.

Now I have a narrow lexer and stream, but I'm wondering if it's possible to create a narrow parser as well. Specifically using a re-entrant LL(1) parser. In any case, what is the smallest unit that goes out of the parser into the semantic analyzer (a function syntax tree? An entire file? A single statement tree?)? The parser will output a parse tree. Should the parser output something different?

To make it more clear: Lexer -> Parser -> Semantic Analysis

The lexer sends tokens one-by-one to the parser, and the parser parses them. Whenever the parser requests a token, the lexer will provide it. Now I want to try the same for the semantic analyzer. Imagine the semantic analyzer going: getTree(). It causes the parser to parse sufficiently such that an analyzable tree is generated for semantic analysis. The question is about the how to determine the minimally required tree for successful semantic analysis.

Thinking of it: maybe I am asking for a re-entrant semantic analyzer.

I'm pretty sure that the 'Gold parser' system also works on a stream... Not sure if I understand the question. — atlaste, May 20 '15 at 06:12
@atlaste I'm attempting to pipe output from the parser to the semantic analyzer without parsing the entire program first. I want to attempt getting information from the parser as soon as it is available for presentation to the semantic analyzer. — Ultimate Hawk, May 20 '15 at 06:15
Nobody loads the entire stream into memory and converts it all to tokens before proceeding to parsing. — user207421, May 20 '15 at 06:59
@EJP, no, but it is a recognized method that falls under "broad" compilation. It was merely to examplify what I meant. — Ultimate Hawk, May 20 '15 at 07:00
PS: You don't *need* the lexer hack. See http://stackoverflow.com/questions/243383/why-cant-c-be-parsed-with-a-lr1-parser/1004737#1004737 — Ira Baxter, May 20 '15 at 07:24

Ira Baxter · Answer 1 · 2015-05-20T07:19:01.613

1

If you make all the datatypes in your compiler (lexemes, AST node, symbol tables, symbol table entries, control flow graph nodes, generated triples), into futures, you will by definition get information from each stage of the compiler to the next "as early as possible".

You might even get some parallelism, if your language is multithreaded or you can fake it. It isn't clear how much parallelism you might get, because most of these objects are small and the computation spent on each one might not be enough to overcome the overhead of all that scheduling.

(We do this for a Java parser, making symbol spaces into futures. Those are pretty big, and so the synchronization costs get well amortized. When reading several thousand Java source files this makes quite a difference).

If you don't get much parallelism, then probably you'll lose in terms of overall performance because of the overhead. Generally when you want efficiency, you want to batch things up to avoid synchronization in "getting the next thing".

edited May 20 '15 at 07:19

answered May 20 '15 at 06:40

Ira Baxter

93,541
22
172
341

Let's focus on the AST here. Does this imply that the semantic analyzer can validate the AST gradually (small pieces)? In that case; how big are the pieces of AST. That's what I'm getting at. The idea of using futures looks promising (pun intended), but it's not quite what I have in mind. – Ultimate Hawk May 20 '15 at 06:57
Actually, its exactly what you have in mind; we're just arguing about the size of the pieces and how you implement the futures bit. First, no matter what size granularity you pick (variable, expression, statement, block) you'll still have to manage the parser handiing off that size fragment when it gets it, and the "semantic analyzer" getting control and doing its thing, before handing control back to the parser. That's what a future would do anyway. ,,, – Ira Baxter May 20 '15 at 07:12
.... What you are going to discover is that there isn't any really "neat" cutting point where the semantic analyzer has to look only inside what you hand it. And where it has to reach off the top of the tree you hand it, you better have a strategy to prevent it from touching something that doesn't exist yet (C++ famously allows a class member function to refer to a class member variable that hasn't yet been declared). Even if you hand off entire functions, they *still* call each other. Going all futures all the time, ensure that nobody accesses something before it is available :-} – Ira Baxter May 20 '15 at 07:14
This is one of the reasons people tend to build whole ASTs when they can: suddenly, nobody accesses something before it is available :-} People obviously build compilers that hand chunks like functions, and just pay the organizational price. – Ira Baxter May 20 '15 at 07:17
That was what I've been fearing. The semantic analyzer seems to require the entire AST (or just a single function node plus a symbol table) to check semantics (but it has to somehow block on undiscovered identifiers). I wonder how exactly you use futures with your semantic analyzer. It seems more like a call from the semantic analyzer to the parser to get the entire AST, or is that wrong? – Ultimate Hawk May 20 '15 at 07:24
Its easy enough to block on undiscovered identifiers: you put a dummy id into the first symbol table where you think it might be, if that symbol table isn't "complete" yet (e.g., it is the symbol table for the class), and then you set processing of that tree aside. Eventually, the symbol gets defined (you can actively continue processing the tree then or put it off until the scope closes), If the scope closes and the symbol isn't defined, then you push it out to the *next* scope and repeat. Its just that you have to careful to do this. ... – Ira Baxter May 20 '15 at 07:29
The bad news is you may have to keep around a bunch of partially processed trees (might be easier to simply not process them at all until the scope completes); the good news is that you don't often have to keep a lot of them around. (Imagine poor GCC compiling a 2 million line file). People tend to like function boundaries, because they are more or less self contain if you are to compile code in chunks with simple interfaces. As your compiler gets more sophisticated, life gets more compilicated; if you are going to do interprocedural optimization, when you do pitch the current tree? – Ira Baxter May 20 '15 at 07:32
(If you compiler generates code with triples, the problem just moves. When do you get rid of a bunch of triples representing a function, if you are doing interprocedural optimization?) The conceptually cleanest answer is, "when there is no dependency on it". So, an ideal answer would build a dependency list. – Ira Baxter May 20 '15 at 07:33
Are you referring to generating 3-adr IR? I'm not focusing on optimization as of yet, but it may be something to take into consideration. As it stands it appears to pragmatic to generate full function trees that are partially analyzed and put on wait when an unknown symbol is discovered. That seems to be easy to parallelize as well. – Ultimate Hawk May 20 '15 at 07:38

Narrow Parsing in a Compiler

1 Answers1