8

I'm trying to create some kind of lint tool for the C/AL programming language. So basically I need to perform syntax and lexical analysis against the source code. I've planned to write parser from the scratch, but then discovered that there are a lots of tools helping generate these parsers automatically.

I need performance, since checking 20 megabytes of code in one piece is the normal scenario, and I need that tool to be extendable by custom rules. So I decided to go with JavaScript.

As far I have found two generators I can use Jison and PEG.js.

Which of them give me more parsing performance? Maybe not comparing libraries, but algorithms?

Which one is better for my needs (parsing general purpose programming language)?

UPDATE: I have found similar Q&As:

Community
  • 1
  • 1
shytikov
  • 9,155
  • 8
  • 56
  • 103

3 Answers3

9

"I need performance (for 20Mb) ... so I decided Javascript". These are contradictory.

Carefully coded recursive descent parsers can be pretty fast, but you have to write a lot of code. Generally, LALR(1) parsers (generated by Bison from a grammar etc.) are quite fast, and pretty easy to code. (There are technical papers showing how to compile LALR parsers directly to machine code; these parsers are blazingly fast but you need to implement a lot of custom machinery to build one).

If you want flat out high performance parsing with minimal sweat, you should consider LRStar. (I know and highly respect the author but otherwise have nothing to do this). This produces very fast LALR parsers. Downside: you have to make your grammar LALR compatible. You will have to extend your "rules" the same way you extend any other C program: by writing C code. That doesn't seem a lot worse than writing JavaScript IMHO, but the rules will likely execute a lot faster and at the scale you are contemplating this will matter.

GLR parsing is necessarily slower than LALR because it has more bookkeeping to do. But that's just a constant factor. It can be significant (e.g., 100x) over a high performance engine like LRStar. It may be worth the trouble, because it is much easier to pound your grammar into shape, and a less convoluted grammar will likely make writing custom rules easier to do. If you really have millions lines of code, these parsers will like be medium speed at best.

PEG basically is a backtracking. It has to try things, and backtrack when they don't work. They have to be slower than LALR parsers by at least the amount of backtracking they do. You also have the grammar shaping problem.

What you will discover, though, is that parsing simply isn't enough if you want to do anything the faintest bit sophisticated. In that case, you don't want to optimize on parsing; you want to optimize on infrastructure for program analysis. See my essay on Life After Parsing for another alternative.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • Thank you! I've mentioned that I want to have custom rules in `lint`, so I definitely need scripting language and JavaScript (V8 node.js implementation) seems to be faster than python, ruby... – shytikov Aug 01 '12 at 14:44
  • I don't think you understood my point on "Life After Parsing". You seem to have fixed your choice of a rules language on JavaScript, where you basically have zero infrastructure to support your analysis tasks. So you get the advantage of a widely available language in which to code your rules, and no support for those rules to draw on. I'm pessimistic about your actual success, unless your rules are basically trivial.... then what's the point? – Ira Baxter Aug 01 '12 at 16:30
  • I'm reading "Life After Parsing". It's long and I'm impatient. Sorry. I looking at JS due to fact that there are some `lint` tools with source code available what was written on JavaScript. I hope it will help me. But you're right. Task is huge... – shytikov Aug 02 '12 at 06:31
  • what is "MSLOC"? – Paweł Badeński Mar 23 '21 at 10:54
  • == "Million(s) of Source Lines of Code". I revised the answer to make this clearer. – Ira Baxter Mar 23 '21 at 13:37
7

In general you'd get very good parsing performance from a shift-reduce parser such as the one that Jison implements. It's a bit old-school perhaps, but it can work in very tight memory requirements and in linear time.

PEG produces a different kind of parser that is perhaps more capable, but would require more memory to produce the same result. I.e. PEG will require an amount of memory proportional to the input, while a LALR parser will do it in much less space (some tables and a small stack).

Dervall
  • 5,736
  • 3
  • 25
  • 48
  • If the restrictions that will be placed on your grammar from a LR parser like GLR or LALR (GLR only uses a bit more memory, and is a bit slower) then it is likely that LR will be a faster parser. The parser takes longer to generate though since it needs to compute its tables. But in the general case LR parsers are very efficient machines. – Dervall Aug 01 '12 at 12:34
  • Nobody but the guy using the parser generator, cares how long the parser generator takes to run. Even he doesn't care much; big parsers are generated in seconds,at least for LALR parser generators and most of the others that I know about. – Ira Baxter Mar 12 '14 at 21:41
3

As far I have found two generators I can use Jison and PEG.js. Which of them give me more parsing performance?

According to the JavaScript Parser Libraries benchmark I have created It seems that PEG.js is faster (at least on Chrome/V8).

You can run it online here: http://sap.github.io/chevrotain/performance/

Note that this benchmark uses the very simple JSON grammar to compare the performance of the parsing libraries not a larger and more complex programing language grammar.

bd82
  • 201
  • 3
  • 3