How to parse indentation-driven code into AST

Question

This is my code I need to parse into AST:

one
 two
  three
   four
 five
 six
  seven

It is indentation-driven, as you see. I can't really find a way to explain my parser (I'm using Antlr4) that a leading space is an indicator of a sub-level.

@yegor256 I updated the answer I linked to with an ANTLR4 example. — Bart Kiers, Dec 05 '16 at 08:01

score 3 · Accepted Answer · answered Dec 02 '16 at 16:18

3

Basically you can't explain it to the parser without some help from the lexer.

Instead, what you do is hack the lexer to keep track of the number of spaces that start a line as it scans across the spaces. If the space count changes from the previous line, the lexer emits a token. If the count goes up, emit an INDENT token. If the count goes down, emit a DEDENT token.

Now you can add INDENT and DEDENT tokens into the parser rules. They act logically like { and } in C-like languages.

answered Dec 02 '16 at 16:18

Ira Baxter

93,541
22
172
341

That's what I thought. Can you please show how exactly that "hack the lexer" will look. Maybe point me to the right blog post or documentation page? – yegor256 Dec 02 '16 at 16:39
Here I can't help you; I'm not an ANTLR expert. (This is the hack we did for our Python parser with our peculiar lexer, but it is pretty standard stunt). You'l have to dig into the details of writing an ANTLR lexer. – Ira Baxter Dec 02 '16 at 16:55
Hi @yegor256 was your problem solved ? You can Look at the Python3 grammar generated lexer, or just look at the Python3 grammars lexer members part. It could even be a simple Python program to handle INDENT/DEDENT and emit the token on encounter. – Mar 14 '17 at 10:48

How to parse indentation-driven code into AST

1 Answers1