I am moving from a separate lexer and parser to a combined parser that operates on an array of characters. One problem is how to properly handle whitespace.
Problem
Take a language consisting of a sequence of characters 'a' and 'b'. Whitespace is allowed in the input but does not effect the meaning of the program.
My current approach to parse such a language is:
var token = function(p) {
return attempt(next(
parse.many(whitespace),
p));
};
var a = token(char('a'));
var b = token(char('b'));
var prog = many(either(a, b));
This works, but requires unnecessary backtracking. For a program such as '___ba_alb' (Whitespace was getting stripped in the post so let _ be a space), in matching 'b', the whitespace is parsed twice, first for a
and when a
fails, again for b
. Simply removing attempt
does not work as either
will never reach b
if any whitespace is consumed.
Attempts
My first though was to move the token
call outside of the either
:
var a = char('a');
var b = char('b');
var prog = many(token(either(a, b)));
This works, but now prog
cannot be reused easily. In building a parser library, this seems to require defining parsers twice: one version that actually consumes 'a' or 'b' and can be used in other parsers, and one version that correctly handles whitespace. It also clutters up parser definitions by requiring them to have explicit knowledge of how each parser operates and how it handles whitespace.
Question
The intended behavior is that an arbitrary amount of whitespace can be consumed before a token. If parsing the token fails, it backtracks to the start of the token instead of the start of the whitespace.
How can this be expressed without preprocessing the input to produce a token stream? Are there any good, real world code examples of this? The closest I found was Higher-Order Functions for Parsing but this does not seem to address my specific concern.