2

So in my < 24 hours of bison/flex investigation I've seen lots of documentation that indicates that left recursion is better than right recursion. Some places even mention that with left recursion, you need constant space on the Bison parser stack whereas right recursion requires order N space. However, I can't quite find any sources that explains what is going on explicitly.

As an example (parser that only adds and subtracts):

Scanner:

%%
[0-9]+ {return NUMBER;}
%%

Parser:

%%
/* Left */
expression:
    NUMBER
  | expression '+' NUMBER { $$ = $1 + $3; }
  | expression '-' NUMBER { $$ = $1 - $3; }
  ;
/* Right */
expression:
    NUMBER
  | NUMBER '+' expression { $$ = $1 + $3; }
  | NUMBER '-' expression { $$ = $1 - $3; }
  ;
%%

For the example of 1+5-2, it seems with left recursion, the parser receives '1' from the lexer and sees that '1' matches expression: NUMBER and pushes an expression of value 1 to the parser stack. It sees + and pushes. Then it sees 5 and the expression(1), + and 5 matches expression: expression '+' NUMBER so it pops twice, does the math and pushes a new expression with value 6 on the stack and then repeats for the subtraction. At any one point, there is at max, 3 symbols on the stack. So it's like an in-place calculation and operates left to right.

With right recursion, I'm not sure why it has to load all symbols on the stack but I'm going to attempt at describing why that might be the case. It sees a 1 and matches expression: NUMBER so it pushes an expression with value 1 on the stack. It pushes '+' on the stack. When it sees the 5, my first thought is that 5 on it's own could match expression: NUMBER and hence be an expression of value 5 and then it plus the last two symbols on the stack could match expression: NUMBER '+' expression but my assumption is that because expression is on the right of the rule, that it can't jump the gun and evaluate 5 as an expression as a NUMBER since with LALR(1), it already knows more symbols are coming so it has to wait until it hits the end of the list?

TL;DR;

Can someone maybe explain with some detail how Bison manages its parse stack relative to how it does its shift/reduction with the parser grammar rules? Silly/contrived examples welcome!

jshort
  • 1,006
  • 8
  • 23

2 Answers2

7

With LR (bottom-up) parsing, each non-terminal is reduced precisely when its last token is encountered. (LALR parsing is a simplified LR parse which handles lookahead slightly less precisely.) Until a non-terminal is reduced, all its components live on the stack. So if you use right recursion and you are parsing

NUMBER + NUMBER + NUMBER + NUMBER

the reductions won't start until you get to the end, because each NUMBER starts an expression and all the expressions end at the last NUMBER.

If you use left recursion, each NUMBER terminates an expression, so a reduction happens each time a NUMBER is encountered.

That's not the reason to use left-recursion though. You use left-recursion because it accurately describes the language. If you have 7 - 2 - 1, you want the result to be 4, because that's what algebraic rules require: the expression is parsed as though it were (7 - 2) - 1, so 7 - 2 must be reduced first. With right-recursion, you would incorrectly evsluate that as 6, because the 2 - 1 would reduce first.

Most operators associate to the left, so you use left-recursion. For the occasional operator which associates to the right, you need right recursion and you have to live with the stack growing. It's not a big deal. Your machine has tons of memory.

As an example, consider assignment. a = b = 42 means a = (b = 42). If you did it left associatively, you'd first set a to b, and then attempt to set something to 42; (a = b) = 42 wouldn't make sense in most languages and it is certainly not the expected action.


LL (topdown) parsing uses lookahead to predict which production will be reduced. It can't handle left-recursion at all because the prediction ends up in a recursive loop: expression starts with an expression which starts with an expression … and the parser never manages to predict a NUMBER. So with LL parsers, you have to use right-recursion and then your grammar does not correctly describe the language (assuming that the language has left-associative operators, as is usually the case). Some people don't mind this, but I think that grammars should actually indicate the correct parse, and I find the modification necessary to make a grammar parseable with a top-down parser to be messy and hard to read. Your mileage may vary.


By the way, "force down your throat" is a very ungenerous description of documentation which is trying to give you good advice. It is good to be skeptical -- you understand things better if you work at figuring out why they work the way they do -- but many people just want the good advice.

rici
  • 234,347
  • 28
  • 237
  • 341
  • Thanks rici. The 'documentation' I was referring to were tutorials on Flex/Bison and Stackoverflow posts. The gnu.org documentation does imply that left recursion is best due to limiting parse stack usage relative to right recursion but did not go into any conversation about associativity of your grammar etc. Regardless, sorry for the harsh language. – jshort Feb 04 '18 at 21:31
  • 1
    Here are a few relevant SO answers of varying complexity, in case you find them useful. There are others, too. https://stackoverflow.com/questions/35921723/print-tokens-properly-using-lex-and-yacc/35922100 --- https://stackoverflow.com/questions/38662506/bash-script-size-limitation/38662878 --- https://stackoverflow.com/questions/37444660/unclear-how-a-yacc-bison-production-spec-can-cause-a-stack-overflow/37450721 --- https://stackoverflow.com/questions/34496124/reduce-order-in-bison/34496871 – rici Feb 04 '18 at 21:49
1

So after reading this rather important page in the bison documentation:

https://www.gnu.org/software/bison/manual/html_node/Lookahead.html#Lookahead

combined with running with

%debug

and

yydebug = 1;

in my main()

I think I see exactly what is happening.

With left recursion, it sees a 1 and shifts it. The lookahead is now +. Then it determines that it can reduce the 1 to an expression via expression: NUMBER. So it pops and puts an expression on the stack. It shifts + and NUMBER(5) and then sees it can reduce via expression: expression '+' NUMBER and pops 3x and pushes a new expression(6). Basically, by using the lookahead and the rules, bison can determine if it needs to shift or reduce at any time as it reads the tokens. As this repeats, at most there are 3 symbols/groupings on the parse stack (for this simplified expression evaluation section).

With right recursion, it sees a 1 and shifts it. The lookahead is now +. The parser sees no reason to reduce 1 to an expression(1) because there is no rule that goes expression '+' so it just continues shifting each token until it gets to the end of the input. At this point, there is [NUMBER, +, NUMBER, -, NUMBER] on the stack so it sees that the most recent number can be reduced to expression(2) and then shifts that. Then the rules start getting applied (`expression: NUMBER '-' expression') etc etc.

So the key to my understanding is that Bison uses the lookahead token to make intelligent decisions about reducing now or just shifting based on the rules it has at its disposal.

jshort
  • 1,006
  • 8
  • 23
  • 1
    This is very close. The parset indeed uses lookahead to decide whether to shift or reduce. But the reduction can only happen exactly at the moment when the entire production right-hand side has been read, no sooner (obviously, no?) and no later. With right recursion, all the nested recursive expansions are bundled up at the end of the input. With left-recursion, they bunch tobthe left. The span of each production is independent of the parsing algorithm's ability to suss it out, and indeed there are different ways to do that, all of which do the same reductions at the same point in input. – rici Feb 04 '18 at 22:12