3

I am moving from a separate lexer and parser to a combined parser that operates on an array of characters. One problem is how to properly handle whitespace.

Problem

Take a language consisting of a sequence of characters 'a' and 'b'. Whitespace is allowed in the input but does not effect the meaning of the program.

My current approach to parse such a language is:

var token = function(p) {
    return attempt(next(
         parse.many(whitespace),
         p));
};

var a = token(char('a'));
var b = token(char('b'));

var prog = many(either(a, b));

This works, but requires unnecessary backtracking. For a program such as '___ba_alb' (Whitespace was getting stripped in the post so let _ be a space), in matching 'b', the whitespace is parsed twice, first for a and when a fails, again for b. Simply removing attempt does not work as either will never reach b if any whitespace is consumed.

Attempts

My first though was to move the token call outside of the either:

var a = char('a');
var b = char('b');

var prog = many(token(either(a, b)));

This works, but now prog cannot be reused easily. In building a parser library, this seems to require defining parsers twice: one version that actually consumes 'a' or 'b' and can be used in other parsers, and one version that correctly handles whitespace. It also clutters up parser definitions by requiring them to have explicit knowledge of how each parser operates and how it handles whitespace.

Question

The intended behavior is that an arbitrary amount of whitespace can be consumed before a token. If parsing the token fails, it backtracks to the start of the token instead of the start of the whitespace.

How can this be expressed without preprocessing the input to produce a token stream? Are there any good, real world code examples of this? The closest I found was Higher-Order Functions for Parsing but this does not seem to address my specific concern.

Matt Bierner
  • 58,117
  • 21
  • 175
  • 206
  • 1
    So why are you moving away from having a separate lexer? – 500 - Internal Server Error Sep 13 '13 at 20:56
  • I am working with ECMAScript. During lexing the '/' character is ambiguous, it can be a division symbol or part of a regular expression literal. The only way to determine what '/' means is the symbol's context during parsing. This requires some very ugly hacks with the separate lexer and parser. [Levi's answer to this question](http://stackoverflow.com/questions/18214179/combining-lexer-and-parser-in-a-parser-combinator) lead me to believe that combining the lexer and parser may be a good approach. Any alternative designs would be welcome as well. – Matt Bierner Sep 14 '13 at 04:08
  • How is this different from, e.g. `<` being either (part of) a relational operator or else the opening of a template expression in c-style languages? Normally you should be able to defer any decisions about what a character is that would introduce ambiguity from the lexer stage to the parsing stage. – 500 - Internal Server Error Sep 15 '13 at 04:01
  • That is an example of using the same token in different productions, although I'm not sure how C++11 handles expressions like `vector>` where `>>` appears. Both relational expressions and template expressions operate on `<` punctuator tokens. In Javascript, the '/' can be lexed to either a div punctuator token or, taken with some other characters, to a regexp token. Because characters like `"` can appear inside of regular expressions, I do not think lexing to div punctuators and introduction a regular expression production using div punctuators would work. – Matt Bierner Sep 16 '13 at 13:39
  • I was going to suggest Higher-Order Functions for Parsing but you have been there. Since you don't list the grammar, is it really two grammars with one being an island grammar? See: http://stackoverflow.com/questions/2561249/island-grammar-antlr3 – Guy Coder Dec 01 '13 at 16:57
  • @MattBierner it's even worse in JS... `/` can be a division, the start or the end of a RegEx, or denote, as `//`, a line comment. for this reason, the empty RegEx cannot be written as `//`, and RegEx literals cannot start with an unescaped space character... – flow Jun 17 '14 at 16:50

1 Answers1

0

I solved this problem in a JSON parser that I built. Here's the key parts of my solution, in which I followed the 'write the parsers twice' approach somewhat:

  1. define the basic parsers for each token -- number, string, etc.

  2. define a token combinator -- that takes in a basic token parser and outputs a whitespace-munching parser. The munching should occur after so that whitespace is only parsed once, as you noted:

    function token(parser) {
        // run the parser, then munch whitespace
    }
    
  3. use the token combinator to produce the munching token parsers

  4. use the parsers from 3. to build the parsers for the rest of the language

I didn't have a problem with having two similar versions of the parsers, since each version was in fact distinct. It was slightly annoying, but a tiny cost. You can check out the full code here.

Matt Fenwick
  • 48,199
  • 22
  • 128
  • 192