23

I have a huge grammar developed for pyparsing as part of a large, pure Python application. I have reached the limit of performance tweaking and I'm at the point where the diminishing returns make me start to look elsewhere. Yes, I think I know most of the tips and tricks and I've profiled my grammar and my application to dust.

What next?

I hope to find a parser that gives me the same readability, usability (I'm using many advanced features of pyparsing such as parse-actions to start the post processing of the input which is being parsed) and python integration but at 10× the performance.

I love the fact the the grammar is pure Python.

All my basic blocks are regular expressions, so reusing them would be nice.

I know I can't have everything so I am willing to give up on some of the features I have today to get to the requested 10× performance.

Where do I go from here?

Zearin
  • 1,474
  • 2
  • 17
  • 36
Tal Weiss
  • 8,889
  • 8
  • 54
  • 62

5 Answers5

9

It looks like the pyparsing folks have anticipated your problem. From https://github.com/pyparsing/pyparsing/blob/master/docs/HowToUsePyparsing.rst :

Performance of pyparsing may be slow for complex grammars and/or large input strings. The psyco package can be used to improve the speed of the pyparsing module with no changes to grammar or program logic - observed improvments have been in the 20-50% range.

However, as Vangel noted in the comments below, psyco is an obsolete project as of March 2012. Its successor is the PyPy project, which starts from the same basic approach to performance: use a JIT native-code compiler instead of a bytecode interpreter. You should be able to achieve similar or greater gains with PyPy if switching Python implementations will work for you.

If you're really a speed demon, but want to keep some of the legibility and declarative syntax, I'd suggest having a look at ANTLR. Probably not the Python-generating backend; I'm skeptical whether that's mature or high-performance enough for your needs. I'm talking about the goods: the C backend that started it all.

Wrap a Python C extension module around the entry point to the parser, and turn it loose.

Having said that, you'll be giving up a lot in this transition: basically any Python you want to do in your parser will have to be done through the C API (not altogether pretty). Also, you'll have to get used to very different ways of doing things. ANTLR has its charms, but it's not based on combinators, so there's not the easy and fluid relationship between your grammar and your language that there is in pyparsing. Plus, it's its own DSL, much like lex/yacc, which can present a learning curve – but, because it's LL based, you'll probably find it easier to adapt to your needs.

mtraceur
  • 3,254
  • 24
  • 33
Owen S.
  • 7,665
  • 1
  • 28
  • 44
  • psyco is dead and no longer maintained. However I found PyPy and giving it a try. 2013 answers? – Abhishek Dujari May 08 '13 at 00:26
  • yes I did look into cython but I can't wrap my head around how to cython works with pyton programs here. Apparently there is ready stuff for pyparsing and cython but its in 2.x pyparsing branch for python 3.0 Or I am totally confused now. – Abhishek Dujari May 15 '13 at 17:00
  • 2
    I tried Pypy. Yes it does work and it is fantastic. It cut my pyparsing execution from 250 secs to about 48 seconds. That is huge. I was using same sample data. I will now try Cython, but from what I keep reading Pypy does beat Cython. (note i was using the windows binary of pypy which is only 32 bit) – Abhishek Dujari May 16 '13 at 23:31
  • Thanks for the report, Vangel! I've updated with PyPy, and if Cython works just report back here. – Owen S. May 17 '13 at 21:39
  • For me, running with pypy without any modifications turns my 6h script to 1h. – OriolJ Aug 21 '13 at 09:34
  • Pyparsing is no longer hosted on wikispaces.com. Go to https://github.com/pyparsing/pyparsing – PaulMcG Aug 27 '18 at 13:19
  • Thanks, Paul! Updated the link. – Owen S. Sep 04 '18 at 19:08
2

Switch to a generated C/C++ parser (using ANTLR, flex/bison, etc.). If you can delay all the action rules until after you are done parsing, you might be able to build an AST with trivial code and then pass that back to your python code via something like SWIG and process on it with your current actions rules. OTOH, for that to give you a speed boost, the parsing has to be the heavy lifting. If your action rules are the big cost, then this will buy you nothing unless you write your action rules in C as well (but you might have to do it anyway to avoid paying for whatever impedance mismatch you get between the python and C code).

BCS
  • 75,627
  • 68
  • 187
  • 294
2

If you really want performance for large grammars, look no farther than SimpleParse (which itself relies on mxTextTools, a C extension). However, know now that it comes at the cost of being more cryptic and requiring that you be well-versed in EBNF.

It's definitely not the more Pythonic route, and you're going to have to start all over with an EBNF grammar to use SimpleParse.

jathanism
  • 33,067
  • 9
  • 68
  • 86
2

A bit late to the party, but PLY (Python Lex-Yacc), has served me very well. PLY gives you a pure Python framework for constructing lex-based tokenizers, and yacc-based LR parsers.

I went this route when I hit performance issues with pyparsing.

Here is a somewhat old but still interesting article on Python parsing which includes benchmarks for ANTLR, PLY and pyparsing. PLY is roughly 4 times faster than pyparsing in this test.

bohrax
  • 1,051
  • 8
  • 20
1

There's no way to know what kind of benefit you'll get without just testing it, but it's within the range of possibility that you could get 10x benefit just from using Unladen Swallow if your process is long-running and repetitive. (Also, if you have many things to parse and you typically start a new interpreter for each one, Unladen Swallow gets faster - to a point - the longer you run your process, so while parsing one input might not show much gain, you might get significant gains on the 2nd and 3rd inputs in the same process).

(Note: pull the latest out of SVN - you'll get far better performance than the latest tarball)

Nick Bastin
  • 30,415
  • 7
  • 59
  • 78
  • Nick - I started reading about US, and while I'm installing, compiling and building I came across the Pycon2010 presentation benchmarks. I don't see any benchmark with even a 2X gain over CPython 2.6.4! Why do you expect anything better? That said, this is the easiest option so I might as well just try it... – Tal Weiss Jul 03 '10 at 22:20
  • @Tal: My own personal experience, really (able to get a 3.5-4x speed improvement on some parsing code). The US benchmarks are real world benchmarks, which is useful, but they miss the benefits of retooling your code to more benefit from US - specifically by creating a long-running process instead of a bunch of short lived processes to do your work. When I did my parsing tests, the difference in parsing one file was marginal - maybe 5-10% faster - but by the time the 10th file was fed through the same parser process, it was operating almost 400% faster. – Nick Bastin Jul 04 '10 at 01:13