4

I'm making an LR(1) parser, and I've run across performance bottlenecks in various places.

I'd like to try optimizing the the data structures for the parser, but in order to do so, I need a rough idea of how many states, rules, and terminal symbols are reasonable for (possibly complicated) computer languages, like C++.

My guesses are that a typical grammar for a complicated language would have:

  • ≤ 100 terminal symbols
  • ≤ 50 symbols per production
  • ≤ 2,000 rules
  • ≤ 10,000 states

but I really don't know how correct they are.

Note that I assume each rule is of the form nonterminalsymbol symbol symbol..., so a single compound "rule" that looks like foo: (bar | baz)+ might actually consist of, say, 5 rules, instead of just 1 rule.

Are they reasonable? If not, where I find some numbers on these?

user541686
  • 205,094
  • 128
  • 528
  • 886
  • I'd suggest talking about some other language in your question, since I don't think C++ can be parsed by LR(1) at all. – Ben Voigt Jan 04 '13 at 04:52
  • @BenVoigt: I plan on making it a generalized LR parser (GLR), which I think can theoretically handle C++. If I understand correctly the problem with C++ is that the LR(1) grammar is ambiguous, not that it's nonexistent, right? So that should be fine. – user541686 Jan 04 '13 at 04:53
  • Where is the bottleneck? The lookup tables/sets should only cost `O(1)` for the next state. – leppie Jan 04 '13 at 04:56
  • 1
    @leppie: Unfortunately (or fortunately?) the bottleneck is in the *generation* of the parser (figuring out all the states), not in the actual parser. – user541686 Jan 04 '13 at 04:57
  • 1
    @Mehrdad: Ahhh, that's makes a lot more sense. But generating the parser should only be a 'one-time' job. Personally, I would not worry. ;p – leppie Jan 04 '13 at 05:00
  • @leppie: Yeah I know, but the fact that it takes a long time for a relatively simple grammar (a very-stripped-down version of Python's grammar took ~4 seconds) is really bugging me (no pun intended), so I'd like to fix it if I can. :) – user541686 Jan 04 '13 at 05:01
  • @Mehrdad: Have you read: http://dickgrune.com/Books/PTAPG_1st_Edition/ ? – leppie Jan 04 '13 at 05:01
  • @leppie: Nope I haven't! I'll take a look at it, thanks for the pointer! :) – user541686 Jan 04 '13 at 05:02
  • @Mehrdad: 4 seconds is pretty long ;p Have you looked at the source code of Bison? IIRC, it is pretty simple. – leppie Jan 04 '13 at 05:02
  • Also this is a little irrelevant, but for posterity's sake, I just ran across [this page](http://binarysculpting.com/2012/02/04/computing-lr1-closure/) on how to compute closures for canonical LR(1) parsers, so I'll just put it in this comment here. :) – user541686 Jan 04 '13 at 05:02
  • If you are making a parser generator, it would be much better to not have limits like this. – brian beuning Jan 04 '13 at 05:03
  • @brianbeuning: Yeah I'm not sure... I'd like to have some numbers before making a decision, because it does make a difference for me how big these are in practice. – user541686 Jan 04 '13 at 05:10
  • I side with the other people saying that it's not really important since it's a one-time job. 4 seconds really isn't that long, especially if you compare with the time it takes to compile anything meaningful. – zneak Jan 04 '13 at 05:39
  • If your grammar is ambiguous, then by definition it isn't LR(1). Secondly, LR(1) grammar tables tend to be a LOT bigger than LALR(1) tables because every context is treated uniquely, but you don't get much additional power in recognizing practical languages. Third, if you have a GLR parser, you'll can manage with LR(0) parse tables; you don't need the lookaheads. As others point out, it is irrelevant if the parser generator takes a few seconds. YOU can't generate grammar revisions that fast. Last point: building a C++ grammar that matches what the compilers really do is a messy process. – Ira Baxter Jan 04 '13 at 05:47
  • @zneak: I'd still like to have an idea what I'm up against... I really don't have any numbers right now. – user541686 Jan 04 '13 at 05:49
  • @IraBaxter: That's a good point regarding the lookaheads, but the reason I haven't used LR(0) is that my LR(1) path is a lot faster than my GLR path. I'm switching to GLR when I actually hit a conflict. As for the 4 seconds... I could live with it in production, but debugging is another matter... but even then, I'd still like to learn what the practical limits are, if for nothing other than the sake of learning. Do you happen to have any rough numbers on these by any chance? – user541686 Jan 04 '13 at 05:50
  • I don't understand "I'm make an LR(1) parser (headline)" ... "switching to GLR (on) conflict". For debugging, one minute is faster than you can make debug and changes, so this just doesn't matter in practice, and you get to sip your coffee. I've provided table sizes, but they are relevant to L(AL)R(0), not LR(1). You'll pay a really awful space cost for LR(1). If you have GLR backup, you might as well use L(AL)R(1). You still get the statistical advantage of LR parsing most of the time with GLR fallback. – Ira Baxter Jan 04 '13 at 06:16
  • @leppie: Generating a working set of grammar rules is arguably "one time", but it can take a LOT of iterations. Read the reference manual, adjust the rules, try them on real code, oops, real code does things the reference manual doesn't allow (but the real compilers do), guess the grammmar rules for the real compiler, adjust the grammar, try again. The time you spend in this iteration is vastly dominated by re-running your 10,000 sample programs you are using to validate the grammar. (And you still don't get it right!). – Ira Baxter Jan 04 '13 at 06:26

1 Answers1

6

The DMS system I developed daily processes a production IBM Enterprise COBOL front end grammar in about 7 seconds on a crummy laptop (measured just now on that laptop).

The grammar has about 500 terminals and 2500 productions, averaging about 2.5 tokens per production. Our productions are exactly as you describe them (no EBNF, just doesn't buy enough to matter, and yes, we're big DSL fans. Sometimes the geegaws people put in a DSL aren't worth it). The parser generator produces 3800 states. (These values measured just now, too).

DMS has a full C++11 grammar with lots of extra stuff to handle GCC and MS dialects, as well as OpenMP. The grammar has 457 terminals, some 3000 productions, some 2.3 tokens per production average. The parser generator produces 5800 states. Takes somewhat longer to generate: 11 seconds, on an i7. What you might find surprising is that it takes some tens of seconds to generate the lexer (really multiple lexers); there's a lot more lexing weirdness in C++11 than you'd expect.

The generator is a GLR generator of our own implementation.

We didn't do a lot to optimize the generation time. It likely can be sped up by a factor of 10 or more; we don't do sophisticated cycle detection optimization as suggested in most papers on LR parser generation. The consequence is that it takes longer to generate the tables but nothing is lost in functionality. We've never had enough motivation to do that optimization, because there are so many other things to do with a language front end than worry about the parser table generation time.

I doubt the data structures matter a lot, if designed reasonably. We don't worry much about sizes of rules, item sets or states; we just use dynamic arrays and they take care of themselves. We do pack lookaheads into dense bit vectors.

As additional background data, you'll probably find this paper useful: Tiago Alves and Joost Visser, Metrication of SDF Grammars. Technical Report, DI-Research.PURe-05.05.01, Departamento de Informática, Universidade do Minho, May 2005.

The parser generator isn't where you have a difficult time with grammars. It is getting the grammar rules right for the specific implementations.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • Wow thanks for going through the trouble of measuring! And for the advice as well. This is an awesome answer, thanks so much! :) – user541686 Jan 04 '13 at 06:44
  • It's also reassuring to know that my ~4s is in the right ballpark! – user541686 Jan 04 '13 at 06:51
  • You said that was for a stripped down Python grammar, not a full production grammar. I suspect it is high, because you are generating LR(1) tables. There's a corresponding *time* cost to generating all those extra states. – Ira Baxter Jan 04 '13 at 06:52
  • Ah yeah, it's for a stripped down Python grammar, so it's still definitely slow all right -- what I meant when I said it's in the right ballpark is that it seems to be on the right track, not that it's already reached its goal. :) – user541686 Jan 04 '13 at 07:05
  • So I managed to get it down to ~2 seconds by putting all productions into a single contiguous array and representing each item core as an iterator into that array, and then using a bit vector for representing item sets in one place. The bottleneck is still in memory allocation (and some non-locality in the data). Using a hashtable (`unordered_map` instead of `map`) brings it down to ~1.4 seconds... and now I seem to be hitting a wall, unless I try to use a better allocator or something... the fact that I'm representing item sets as `set >` seems to be what's slowing it down. :\ – user541686 Jan 04 '13 at 12:03
  • Okay so I just gave it a _real_ grammar and it's been working for 5 minutes now, at 300 MB of memory and growing... >__> any tips on how to calculate closures efficiently? I'm trying to cache the results but it doesn't seem to help so much... – user541686 Jan 10 '13 at 04:31
  • Read anything by David Pager, but this paper on LR(1) parser generation will probably interest you; IIRC, this contains the state-cycle optimization we didn't do. http://dl.acm.org/citation.cfm?id=57669.57684. Adrian Johnstone writes lots of papers on parsing. Dick Grune writes pretty nice stuff; see http://dickgrune.com/Books/PTAPG_1st_Edition/, but he has a more recent book out. All likely to have good ideas. – Ira Baxter Jan 10 '13 at 04:48
  • Oh by the way, I'm basically using [this](http://stackoverflow.com/a/14252808/541686) procedure (except a C++ version) to generate the closures. – user541686 Jan 10 '13 at 08:23
  • I didn't pay much attention. A closure on the nucleus set to get the item set is pretty straightforward if you precompute follow in-context-sets (also a closure problem) for each item in a grammar rule. We use Warshall's algorithm to do this on the entire grammar at once; it goes pretty fast. .... – Ira Baxter Jan 10 '13 at 12:01
  • .... But Pager's key trick, that we didn't use, is to discover cycles in the states graph as it is being generated, and treat all the states in that cycle as a single entity for the purposes of lookahead set propagation, which is also a closure problem. This is where our cheesier version spends most of its time, IIRC. – Ira Baxter Jan 10 '13 at 12:02
  • ... but I'll reiterate my thought: I don't think doing LR(1) is a good idea. It buys only a little bit of additional discrimination power over LALR *in practice* if all you are doing is conventional L(AL)R parsing, and if you are backing up the "fast" L(AL)R parser with GLR where necessary, I think it buys you *nothing* in practice except huge state graphs and long generation times. – Ira Baxter Jan 10 '13 at 12:04
  • So I ended up making it LR(k) (or rather, LR(0) with k additional lookahead for states that require it). It *seems* correct as far as I can tell, and it's pretty efficient too (takes 5 seconds for a 1430-state machine with the bound k = 2 [for this C# grammar](http://slps.github.io/zoo/cs/csharp-msft-ls-4.0.dms)). I can't tell for certain if it's correct though; it's quite likely I'm missing some edge cases... but I'm not sure what the best way to look for edge cases is. If you have any ideas I'd appreciate them! – user541686 Nov 27 '13 at 10:48