How do I construct abstract syntax trees from a list of tokens using a context-free grammar?

Question

I'm writing a C compiler in Javascript, purely for enrichment (as of yet, I expect no practical use for it, and I will likely not maintain it).

I wrote a lexer that can successfully tokenize, given a list of regular expressions and the type matched by that regular expression, any string.

I have been able to successfully tokenize C source code (somewhat reduced C, to be fair; I need to add more tokenizer patterns to capture everything). I now want to construct ASTs as an intermediate form between the source language and the translated assembly.

To do this, I am trying to implement a function that uses a context-free grammar, defined as an object with

key → target (expression, function declaration, cast expression, etc), and
value → array of mappings
- (where a mapping is an array [order is important] of the symbols that make up the target).

Here is a sample CFG one might feed the parser (this is an adapted excerpt from this C grammar):

var cfg = {
    "cast_exp": [
        ["unary_exp"],
        ["(", "type_name", ")", "cast_exp"]
    ],   
    "unary_exp": [
        ["primary_exp"],
        ["++", "unary_exp"],
        ["--", "unary_exp"]
    ],
    "primary_exp": [
        ["id"]
    ]
};

id is one of the types my tokenizer picks up, so I suppose we can consider "primary_exp" a start symbol.

Now, my thought is to do this recursively; that is, pick up the first token and match it against one of the starting symbols. Recurse on the remaining tokens, sending across the target we matched in the previous call, and see what production rule is composed of the target we just matched.

This isn't making too much sense to me and the way I see it, I will get lost in infinite recursion (or encounter stack overflow on really long source files).

How do I write a function that can walk my array of tokens and, using the CFG described above, construct ASTs? Since I'm doing this for enrichment and as a personal challenge, if you'd like you can provide code, but I'm looking more for guidance and a broad-sense description of such an algorithm.

Closer: the question is not too broad, there are not too many answers, and one *can* be provided in format compatible with the site. (See my answer!). Your close reason is not valid. — Ira Baxter, Jul 26 '15 at 09:15

score 4 · Accepted Answer · edited May 23 '17 at 11:51

You can implement an Earley parser. (The Wikipedia site has code, so I'm not supplying any).

Such a parser transitions from state to state as it consumes tokens. In each state, it holds a set of "items":

 {  I1  I2 ...  In }

Each individual item Ik is a rule and how much of that rule has been processed (a place called "dot").

For a rule

  R = A B C D;

where A and B have been seen, an item for R is conceptually the same rule with a dot mark:

  R = A B <dot> C D ;

with the dot indicating that A and B have been seen, and C needs to be found. The state/item set (might) look like:

 {  P = Q <dot> R S ;I1
    R = A B <dot> C D ;
    C = <dot> X Y ;
 }

Each item Ik represents a possible interpretation of the input seen so far; the reason there are multiple items is that the input may have multiple interpretations valid to the current point in the input stream. Processing tokens will change the state/this set of items.

For our example rule R, when a C is found by the parse (either as a token in the input stream, or if some other rule reduces and produces its left hand side as a nonterminal), dot is moved:

 R = A B C <dot> D;

creating a new item for the item set in the next parser state.

All rules in item set are processed for each token; if allows the parser to "shift" over the next rule element, the item with a revised dot is place in the state for the next set; otherwise the rule is no longer a valid interpretation of the input, and is discarded (e.g., isn't placed in the next set). As dot is moved, it indicates that new input is possible, (for rule R above, D is now possible), and rules that allow D to be processed are added to the new state with dot at the beginning of the rule. This may add several new rules to the set.

When a dot ends up at the far end:

 R = A B C D <dot> ;

then in effect R has been seen as a nonterminal (this is called a "reduction" to R) and can be used to advance the dot in other rules in the current state that mention R:

 P = Q <dot> R S ;

transitions to P = Q R S ;

Now, this process is applied to all items (rule+dot) in the current state as tokens are processed.

The parser starts in a first state with a one-element set consisting of the goal (what you called the "start symbol") rule with a dot indicating that no part of the rule has been consumed, in your case:

{  primary = <dot> id ; }

A little thought will let you conclude that the goal rule always stays in the item set with its dot somewhere. Parsing is complete when the dot in the goal rule falls off the end of the goal rule, e.g., when the goal rule reduces to the goal token, and the input stream is entirely consumed.

Earley parsers are relatively fast, and are very general; they will parse any context-free language. That's surprisingly powerful. (If you understand how an Earley parser works with items, you understand most of what you need to know to understand how LR parsers work). And they are pretty easy to build.

The Wikipedia site has an example worked in more detail.

As to building trees with an Earley parser (or any similar type of parser), whenever a reduction to R takes place, you can build a tree node whose root is type R and whose children are the trees previously built for its elements.

Obviously, when processing a terminal token t, one builds a unit tree for t. [It is easy to understand why when you realize your lexer is actually a sub-parser, which is "reducing" strings of characters to the terminal token. You could have put the lexer rules into the grammar operating on character terminal tokens; you just chose not to, for efficiency reasons. You could do this for fun; the Earley parser will do just fine but will run pretty slowly because now it is doing all that item set management on a larger set of rules, on a per-character basis.].

Keeping track of all this stuff while parsing seems a bit tricky, but is not actually that hard. I leave it to the reader.

For a contrast, see how to do all this parsing and tree building using hand-coded recursive descent parsing. (These aren't as powerful, in particular they can have a hard time with left recursive grammar rules, but they are really easy to write if you have a grammar).

Thanks for the answer! What if there are multiple *goals* to begin with? For instance, in C, an AST could be formed from `int i = 0;` as well as with `i = 1;`, however the starting symbol/goal for the first one is a type specifier, while for the second it's an id. — Purag, Jul 26 '15 at 09:15
Here is an Earley Parser I wrote in Python, for reference. https://github.com/TheArtOfEngineering/NLPParser/blob/master/parse2.py — Tyler Cloutier, Jul 26 '15 at 09:16
@Purag: You simply put your multiple goal rules into the starting set. Parsing succeeds if *any* of the goal rules reduce, *and* the input is entirely consumed. Note the *and* condition. — Ira Baxter, Jul 26 '15 at 09:16
Ah, so a *goal* isn't the first symbol matched, it's the rule for the complete statement. As in, one entire translatable unit. We read in `int` and say "hey, move the dot forward for each rule that starts with `type_specifier`," if I'm understanding correctly? — Purag, Jul 26 '15 at 09:21
And for a complicated language like C, would there be an issue (besides performance hit) with including all the rules in the initial set? — Purag, Jul 26 '15 at 09:24
Technically it will work. Yes, you'll take quite a performance hit on the first few state transitions, but after that, the item sets will shrink to match what the input actually forces pretty fast. We do something like this to do *pattern* parsing in our DMS system (see my bio) to try to match a code fragment to the part of the language it represents (and we even do this for C!). We don't use an Earley parser because the costs are pretty high for parsing really big files; we use GLR instead, but GLR parsers are *much* harder to implement. (They pay off in performance very nicely). — Ira Baxter, Jul 26 '15 at 09:29
"goal isn't the first symbol matched" ... Right. A goal rule is what you decide the goal rule to be. People parsing full source files think the goal rule is the "compilation unit", but you can choose the goal rule to be any or all of the rules in your grammar, *if* you remember to implement the *and* empty-input-stream check. (If a rule reduces and the input stream still has data, then you aren't don't parsing yet). — Ira Baxter, Jul 26 '15 at 09:33
Your latest edit was gold. I was sitting here wondering "but why can't I just replace my lexer with terminal rules in the CFG?" My tokenizer actually sucks right now, I need to optimize it, but I'm very, very interested in parsing everything in one pass instead of two. — Purag, Jul 26 '15 at 09:52
@Purag: Well, if you hate your lexer you can always write a better one. But you don't have to have two passes just because you hate it :-} In any parser there is clearly a point where a next input token is needed (either a terminal, or a character if you go nanoscopic). At that point, you can simply call the lexer to get the next token. Everything then happens in one pass. — Ira Baxter, Jul 26 '15 at 09:58
"goal isn't ..." comment above needs a patch: *aren't don't* should have been *aren't done*. (Fastest fumble in the West). — Ira Baxter, Jul 26 '15 at 10:06
Hey Ira, thought I'd update you...I was finally able to put together the Earley parser! [Check it out](https://github.com/purmou/pearley). The only thing I haven't resolved the issue of empty productions yet, but I'm working on it. :) — Purag, Jul 29 '15 at 16:24
Congratulations. Yes, empty productions are fun, but don't change the basic idea. What did you do about the lexer? — Ira Baxter, Jul 29 '15 at 16:36
@IraBaxter: Haven't worked on it yet; I think I have an idea that will improve it, but we'll see. For now, my parser now accepts grammars in BNF, so all I need to do is write an `eval()` function to go through the parse tree and generate a compatible grammar for my parser, and I'll have a parser generator. :) [Check it out here](https://jsfiddle.net/purmou/a82kr75n/11/embedded/result/) — Purag, Aug 05 '15 at 22:53
@IraBaxter Well, turns out I don't need to change my tokenizer, it's plenty fast. After doing some CPU profiling I learned that what was really holding up the page was printing out the results. Haha. — Purag, Aug 07 '15 at 15:41

How do I construct abstract syntax trees from a list of tokens using a context-free grammar?

1 Answers1