8

I am writing a parser for a pet project, and for educational purposes, would like to do it by hand instead of using a parser generator. Unfortunately, many online resources (and the compiler course I took at university) assume some sort of LL(k) grammar. I don't want to left factor the grammar though since left-recursive non-terminals such as

E ::= E * E
    | E / E
    | E + E
    | E - E
    | ...

that are possible in bison-like parser generators seem to greatly simplify the grammar.

What would be the simplest way to hand-parse such a grammar? My research so far tells me that recursive descent is not an option for reasons explained here, and that LR parsing using recursive ascent may be a good choice, though several sites pause to mention that it is non-intuitive.

Community
  • 1
  • 1
Matt Kline
  • 10,149
  • 7
  • 50
  • 87
  • Try left-recursive Packrat extensions (they're perfectly compatible with handwritten parsers too). It will look pretty much like recursive descent parsing, with a bit of boilerplate for memoisation. See details here: http://www.vpri.org/pdf/tr2007002_packrat.pdf – SK-logic Jul 29 '13 at 14:33
  • Check out my SO answer on how to hand-code parsers: http://stackoverflow.com/questions/2245962/is-there-an-alternative-for-flex-bison-that-is-usable-on-8-bit-embedded-systems/2336769#2336769 For "left recursion", there must always be an alternative rule; you hand code this by check the alternative first, and looping back if you find it. – Ira Baxter Jul 29 '13 at 19:22

2 Answers2

6

If you want to build a mostly recursive descent parser for a toy language whose only left recursion is in more-or-less standard expression syntax, then you should consider a Pratt parser (Java) (Javascript). Alternatively (and equivalently, as it turns out), you can use an operator precedence parser; the algorithm is well described in the Dragon book and there are lots of available examples using the shunting yard algorithm (but beware; many people who enthusiastically discover this algorithm and immediately write it up for their blog are far from reliable authorities.)


Note for loose parsing:

If you want to parse an expression grammar, and you don't care about operator precedence (for example, if you only need to syntax colour the code), you can easily reframe the expression grammar to avoid left-recursion.

The starting point is this, using * for the Kleene star, ? for optional, and ( ) for grouping:

expression: term (INFIX term)*
term: PREFIX* operand postfix*
operand: ID
       | CONSTANT 
       | '(' expression ')'
       ;
postfix: POSTFIX
       | '[' expression ']'
       | '(' ( expression (',' expression)* )? ')' 
       ;

Recursive descent parsers can easily deal with the * and ? operators in the above, and there is no left-recursion, so that should suffice.

Note that C has the awkwardness of cast expressions, which are syntactically indistinguishable from function calls unless you know that the first parenthesized expression contains a type instead of an expression. However, for loose parsing purposes, it's fine to parse them as though they were function calls, using the above grammar.

C++'s use of angle brackets both as operators and as template parameters is harder to deal with. Many syntax colourers I've seen just ignore the template parameter case altogether; getting them right is a challenge, but might be worthwhile.


Editorial comment:

Ignore this if you like, but I personally think that learning to work with bottom-up LR parsers, particularly GLR parsers, is a far more enriching educational experience than trying to shoehorn a language into a restricted parsing algorithm, particularly one in which the details of the grammar are obscured as code paths. But having said that, it may be valuable to do both a bison and hand-generated parser, if only to see how much more elegant and simple the bison parser can be.

rici
  • 234,347
  • 28
  • 237
  • 341
  • I'm actually doing this for a code beautifier similar to astyle or uncrustify, so the target language is C, followed by C++, Java, etc. Before you fall out of your chair and declare me insane (because a C++ parser is quite the project), it will be a very loose parse - I only need enough information to do indentation and spacing according to user-set rules. – Matt Kline Jul 29 '13 at 16:05
  • @slavik262: For educational purposes, I present also a non-left-recursive expression grammar. I still think you should seriously consider a parser generator, though. – rici Jul 29 '13 at 16:33
  • 1
    What you are going to have a bad time with is any kind of preprocessor conditional or macro that fills out a code block with missing bracket, e.,g. x = ( myrightparengeneratingmacro() ; – Ira Baxter Jul 29 '13 at 19:13
  • @IraBaxter: do you know a pretty-printer/syntax-colouriser which gets that sort of thing right? Me, I'd be happy with template angle-brackets being balanced and coloured. – rici Jul 29 '13 at 19:53
  • @rici: If all you want is balanced brackets, most modern IDEs (certainly even crufty old Emacs) will generally do that. If you want prettyprinters that actually know the language syntax, check out my bio; they even handle the preprocessor craziness to some degree. – Ira Baxter Jul 29 '13 at 19:56
  • @IraBaxter: Emacs handles ` x = ( myrightparengeneratingmacro() ;` correctly? That's a surprise. Vim certainly doesn't. (But I consider that use of the preprocessor unreasonable, so I don't really care, although I have been known to indulge.) What I was asking for was a colorizer that knows how to balance angle-brackets, which requires knowing when an >> is a shift and when it's two angle closes. (Not difficult if you can distinguish between less than and angle open, but that in itself is far from trivial, as you know.) – rici Jul 29 '13 at 20:18
  • No, in spite of being glorious in many ways, emacs doesn't handle macros hiding parentheses either. To solve the >> problem for C++, you're likely to need a real parser. – Ira Baxter Jul 30 '13 at 09:59
1

Recursive descent parsers are easy, once you understand that they are just the inverse (or is that the converse?) of BNF.

A recent chunk of one that I wrote:

/**
 * Parse an addition or subtraction expression.
 *
 * <p/><code><pre>
 * AdditiveExpression
 *     MultiplicativeExpression
 *     AdditiveExpression "+" MultiplicativeExpression
 *     AdditiveExpression "-" MultiplicativeExpression
 * </pre></code>
 *
 * @return an expression
 */
Expression parseAdditiveExpression()
    {
    Expression expr = parseMultiplicativeExpression();
    while (peek().getId() == Id.ADD || peek().getId() == Id.SUB)
        {
        expr = new RelOpExpression(expr, current(), parseMultiplicativeExpression());
        }
    return expr;
    }

/**
 * Parse a multiplication / division / modulo expression.
 *
 * <p/><code><pre>
 * MultiplicativeExpression
 *     PrefixExpression
 *     MultiplicativeExpression "*" PrefixExpression
 *     MultiplicativeExpression "/" PrefixExpression
 *     MultiplicativeExpression "%" PrefixExpression
 *     MultiplicativeExpression "/%" PrefixExpression
 * </pre></code>
 *
 * @return an expression
 */
Expression parseMultiplicativeExpression()
    {
    Expression expr = parsePrefixExpression();
    while (true)
        {
        switch (peek().getId())
            {
            case MUL:
            case DIV:
            case MOD:
            case DIVMOD:
                expr = new RelOpExpression(expr, current(), parsePrefixExpression());
                break;

            default:
                return expr;
            }
        }
    }

I know lots of people who swear by the tools that allow the automation of parser generation, but I personally enjoy understanding the details of how a language is lexed and parsed, so I never let the tools do this fun work for me.

Kuba hasn't forgotten Monica
  • 95,931
  • 16
  • 151
  • 313
cpurdy
  • 1,177
  • 5
  • 12