Has anyone got a simple example of how to define a grammar that parses python-like indentation for blocks using Jison?
-
1Hi! [this question and its answers](http://stackoverflow.com/questions/1547944/how-do-i-parse-indents-and-dedents-with-pyparsing) help you lot. I think. – Grijesh Chauhan Feb 12 '13 at 06:53
-
although actually unless jison has the same feature as pyparsing, perhaps it doesn't really answer the question – interstar Feb 13 '13 at 18:10
-
Looking for the same thing – Clearly Feb 13 '13 at 22:29
-
Coffeescript is one of these. In its begginings, it had a token named `INDENT`, but I don't really understand their grammar now. – fiatjaf Oct 30 '14 at 11:04
1 Answers
I created a language using Jison which uses python-style indentation. It's an automated white-box algorithm testing language called Bianca.
Bianca only has two dependencies - one is Jison and the other one is Lexer. Jison supports custom scanners and Lexer is one such scanner.
In C-style programming languages blocks of code are delimited by curly braces. In python-style indentation however you have INDENT
and DEDENT
tokens.
Writing a rule to generate INDENT
and DEDENT
tokens in Lexer is brain-dead simple. In fact the Lexer documentation shows precisely how to do it.
This snippet of code is taken directly from the source code of Bianca (lexer.js):
var indent = [0];
lexer.addRule(/^ */gm, function (lexeme) {
var indentation = lexeme.length;
col += indentation;
if (indentation > indent[0]) {
indent.unshift(indentation);
return "INDENT";
}
var tokens = [];
while (indentation < indent[0]) {
tokens.push("DEDENT");
indent.shift();
}
if (tokens.length) return tokens;
});
A brief explanation of how this code works can be found in the Python documentation:
Before the first line of the file is read, a single zero is pushed on the stack; this will never be popped off again. The numbers pushed on the stack will always be strictly increasing from bottom to top. At the beginning of each logical line, the line's indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is larger, it is pushed on the stack, and one
INDENT
token is generated. If it is smaller, it must be one of the numbers occurring on the stack; all numbers on the stack that are larger are popped off, and for each number popped off aDEDENT
token is generated. At the end of the file, aDEDENT
token is generated for each number remaining on the stack that is larger than zero.

- 72,912
- 30
- 168
- 299
-
I thought jison couldn't use ^ for starts with. Looks like you can add the rule in the jisonfile and use this to get the lexeme related stuff. – Justin Thomas Jun 26 '14 at 01:49
-
How do you do this with the standard lex file? What is col? I'm trying to log (this) on the rule and figure out what all those values came from. – Justin Thomas Jun 28 '14 at 20:39
-
All of those values are defined in [lexer.js](https://github.com/aaditmshah/bianca/blob/master/lib/lexer.js). – Aadit M Shah Jun 29 '14 at 04:01
-
-
You can't. Jison [doesn't currently support the beginning of line identifier](https://github.com/zaach/jison/issues/67 "Line beginning identifier (^) not work. · Issue #67 · zaach/jison"). Hence I wrote a [custom scanner](http://zaach.github.com/jison/docs/#custom-scanners "Jison/Documentation") ([Lexer](https://github.com/aaditmshah/lexer)) to solve this problem. – Aadit M Shah Jun 30 '14 at 03:56