Which tool to use to parse programming languages in Python?

Question

Which Python tool can you recommend to parse programming languages? It should allow for a readable representation of the language grammar inside the source, and it should be able to scale to complicated languages (something with a grammar as complex as e.g. Python itself).

When I search, I mostly find pyparsing, which I will be evaluating, but of course I'm interested in other alternatives.

Edit: Bonus points if it comes with good error reporting and source code locations attached to syntax tree elements.

I think you will find the problem of defining the programming languages to be the hard part of your task. (If you want to parse Python, I'm sure you can get that off the shelf in Python). Parsing Java >=1.5 will be harder. Parsing C++ will be very difficult; wait till you get to C++11x. And you can't do much unless you do name and type resolution ("build symbol tables") after your parse. There's a lot more work here than you might guess. If your task is manipulating programming languages, you might consider a tool that can do this already, rather than trying to roll your own. — Ira Baxter, Jul 06 '11 at 17:37
Similar to: http://stackoverflow.com/questions/2945357/python-how-best-to-parse-a-simple-grammar and: http://stackoverflow.com/questions/1547782/mini-languages-in-python — 0 _, Jul 24 '13 at 16:58

score 37 · Accepted Answer · edited Apr 12 '17 at 16:52

I really like pyPEG. Its error reporting isn't very friendly, but it can add source code locations to the AST.

pyPEG doesn't have a separate lexer, which would make parsing Python itself hard (I think CPython recognises indent and dedent in the lexer), but I've used pyPEG to build a parser for subset of C# with surprisingly little work.

An example adapted from fdik.org/pyPEG/: A simple language like this:

function fak(n) {
    if (n==0) { // 0! is 1 by definition
        return 1;
    } else {
        return n * fak(n - 1);
    };
}

A pyPEG parser for that language:

def comment():          return [re.compile(r"//.*"),
                                re.compile("/\*.*?\*/", re.S)]
def literal():          return re.compile(r'\d*\.\d*|\d+|".*?"')
def symbol():           return re.compile(r"\w+")
def operator():         return re.compile(r"\+|\-|\*|\/|\=\=")
def operation():        return symbol, operator, [literal, functioncall]
def expression():       return [literal, operation, functioncall]
def expressionlist():   return expression, -1, (",", expression)
def returnstatement():  return keyword("return"), expression
def ifstatement():      return (keyword("if"), "(", expression, ")", block,
                                keyword("else"), block)
def statement():        return [ifstatement, returnstatement], ";"
def block():            return "{", -2, statement, "}"
def parameterlist():    return "(", symbol, -1, (",", symbol), ")"
def functioncall():     return symbol, "(", expressionlist, ")"
def function():         return keyword("function"), symbol, parameterlist, block
def simpleLanguage():   return function

Looks very nice. As I'll use Python in the initial design phase only, when performance is not a top requirement, this seems to be it. — Stefan Majewsky, Jul 08 '11 at 14:48
I'm trying to find out how to run this example. I assume it should be something like `from __future__ import unicode_literals, print_function ; from pypeg2 import * ; f = parse(example_string,simpleLanguage)`. Provided that you load the above example as example_string. But that doesn't work. Also, the syntax is very different from the (current )original example on the pyPEG website. Any suggestions how to run the same code? — user989762, Dec 06 '13 at 05:45
@user989762 The link points at version 2 of pyPEG, which indeed uses a different syntax and conceptual model. Here is the version 1: http://fdik.org/pyPEG1/ — Feuermurmel, Jan 16 '14 at 15:33

score 18 · Answer 2 · answered Mar 16 '17 at 17:37

18

I would recommend that you check out my library: https://github.com/erezsh/lark

It can parse ALL context-free grammars, automatically builds an AST (with line & column numbers), and accepts the grammar in EBNF format, which is considered the standard.

It can easily parse a language like Python, and it can do so faster than any other parsing library written in Python.

answered Mar 16 '17 at 17:37

Erez

1,287
12
18

1

Thanks for the well documented library. Only I couldn't find a Python parser, it said there is one in examples. – serg Mar 30 '17 at 04:01
@serg I'm glad you like it! I refrained from adding the python grammars due to perfectionist reasons, but it's there now (in the [examples](https://github.com/erezsh/lark/tree/master/examples)). If you want to do real work with the resulting AST, you'll still have to fine-tune it a bit (or wait until I do). – Erez Apr 05 '17 at 16:02
I love the library... Its just that the grammar reference doesn't give very many real-life examples... Still amazing library. :D – xilpex Apr 10 '19 at 05:49

score 9 · Answer 3 · edited May 11 '20 at 02:10

9

pyPEG (a tool I authored) has a tracing facility for error reporting.

Just set pyPEG.print_trace = True and pyPEG will give you a full trace of what's happening inside.

edited May 11 '20 at 02:10

0 _

10,524
11
77
109

answered Jul 11 '11 at 12:40

Volker Birk

91
1

2

The tone of your answer is that of a conversation. Please use the comments section under the question to discuss the problem. Answers are answers to the post and should be informative, straightforward and not require response. – Michael Allen Dec 10 '15 at 22:46

score 5 · Answer 4 · edited May 11 '20 at 01:48

5

For a more complicated parser I would use pyparsing. Pyparsing

Here is the parsed example from there home page

from pyparsing import Word, alphas

greet = Word(alphas) + "," + Word(alphas) + "!"  # <-- grammar

defined here

hello = "Hello, World!"
print(hello, "->", greet.parseString(hello))

edited May 11 '20 at 01:48

0 _

10,524
11
77
109

answered Jul 04 '11 at 13:15

Jakob Bowyer

33,878
8
76
91

The OP already knows about it and said he is going to evaluate it. Your point? (I am not the downvoter, BTW) – mac Jul 04 '11 at 13:18
3

Pyparsing does belong in the answers, though, to allow up/down votes. It probably would have been better for the OP to include it as an answer right from the start. – Michael J. Barber Jul 07 '11 at 14:55
I have been trying to use pyparsing and am finding it to be too restricting. – John Glen Apr 05 '22 at 23:51

score 4 · Answer 5 · answered Jul 04 '11 at 13:11

4

Antlr is what you should look at http://www.antlr.org

Take a look at this http://www.antlr.org/wiki/display/ANTLR3/Antlr3PythonTarget

answered Jul 04 '11 at 13:11

Ankur Gupta

2,284
4
27
40

8

Looks nice, but I have a personal hatred for anything Java. (Partly because it's an awful language to me, and because installations never worked correctly for me.) – Stefan Majewsky Jul 04 '11 at 13:16
3

Only the code generator tool is Java. You can generate code for several languages, so you don't need to write or deploy any Java code to use ANTLR. – John Zwinck Jul 04 '11 at 13:26
1

You need not know java to use antlr. Try it out they have a free book on the site as well. Once you are through chapter 4 you can simply use the antlr python parser grammer code from the website. – Ankur Gupta Jul 04 '11 at 13:31
@Ankur Gupta, what free book? – Bart Kiers Jul 04 '11 at 14:01
1

@Stefan: Of all the software I've ever installed, the Java runtimes and development kits have given me the least (close to zero) trouble, and I've been doing this since Java first came out. (No, I'm not a Java lover, just a Java user). – Ira Baxter Jul 06 '11 at 16:04
The nice thing about antlr besides the IDE is the algorithm it uses for parsing is LL(k) which after taking compilers and using bison is quite refreshing and easy to use. – fuzzy-waffle Jul 06 '11 at 16:22
that link is broken – fferri Jun 23 '16 at 21:07
@AnkurGupta you can use ANTLRWorks IDE for generating lexers and grammers and testing them. Also can use antlr4 -Dlanguage=Python3 MyGrammar.g4 command to generate python code for your lexer and grammer – Reza Akraminejad Apr 19 '17 at 06:55

score 3 · Answer 6 · answered Jul 08 '11 at 13:35

3

Ned Batchelder did a survey of python parsing tools, which apparently he keeps updated (last updated July 2010):

http://nedbatchelder.com/text/python-parsers.html

If I was going to need a parser today, I would either roll my own recursive descent parser, or possibly use PLY or LEPL -- depending on my needs and whether or not I was willing to introduce an external dependency. I wouldn't personally use PyParsing for anything very complicated.

answered Jul 08 '11 at 13:35

Matt Anderson

19,311
11
41
57

This is bit of outdated comment, but just for the record, I would and have used pyparsing to implement something fairly complicated -- for maintainability, it's fantastic. Its one of the best pure-Python parsing tools in existence. I forget which famous programmer it was who said, "If you're thinking of rolling your own recursive descent parser, don't". Just my two cents. – Gilead Jul 20 '12 at 05:56
3

`PLY` is the preferred solution after possibly prototyping with a lighter and higher level approach like `PyParsing`. The documentation of `PLY` is really good, error reporting is good, and it has a robust plain structure of defining the grammar. Even with `packrat` enabled, `PyParsing` can be orders of magnitude slower than `PLY`. – 0 _ Apr 09 '14 at 09:20

Fredrik Pihl · Answer 7 · 2011-07-04T13:24:40.627

2

For simple task I tend to use the shlex module.

See http://wiki.python.org/moin/LanguageParsing for evaluation of language parsing in python.

edited Jul 04 '11 at 13:24

answered Jul 04 '11 at 13:14

Fredrik Pihl

44,604
7
83
130

2

shlex is not suitable for parsing complex languages like Python – Jul 04 '11 at 13:20
2

@Sentinel, well aware of it,that's why I wrote *simple* :-) – Fredrik Pihl Jul 04 '11 at 13:23

score 2 · Answer 8 · answered Jul 04 '11 at 13:54

2

If you're evaluating PyParsing, I think you should look at funcparserlib: http://pypi.python.org/pypi/funcparserlib

It's a bit similar, but in my experience resulting code is much cleaner.

answered Jul 04 '11 at 13:54

Alexander Solovyov

1,526
1
13
21

score 0 · Answer 9 · answered Jul 08 '11 at 10:36

Antlr generates LL(*) parsers. That can be good, but sometimes removing all left recursion can be cumbersome.

If you are LALR(1)-savvy, you can use PyBison. It has similar syntax to Yacc, if you know what it is. Plus, there are a lot of people out there that know how yacc works.

Which tool to use to parse programming languages in Python?

9 Answers9

Linked