Obtain parse tree for python code

Question

I would like to be able to generate a parse tree for python source code. This code does not have to be compilable, e.g.

if x == 5:

should be turned some sort of tree representation. I can use the Python compiler package to create a tree but this only works for code that is compilable, e.g.

if x == 5: print True

What would the syntax tree of that partial code look like? How do you represent a tree that has a hole in it? — Kevin, May 04 '16 at 03:35
You can use [`pyparsing`](https://pypi.python.org/pypi/pyparsing/2.1.1) to write your own parser if nothing else works. — Akshat Mahajan, May 04 '16 at 03:51
I'm trying to replicate this NLP research paper that uses a tree-based structure for machine translation. The structure of the parse tree in shown on page 5. http://www.phontron.com/paper/oda15ase.pdf — user1879926, May 04 '16 at 03:52
`if x == 5:` is incomplete syntax, you cannot do with built-in ast. — YOU, May 08 '16 at 04:14

score 0 · Answer 1 · answered May 08 '16 at 04:10

The paper you linked to says that used the ast module in the Python standard library. It also says they used a dummy body for the body of the if statement. Use a statement that will be easy to recognize as being a dummy body, like pass or a function call like dummy().

score 0 · Answer 2 · edited May 23 '17 at 12:07

Our DMS Software Reengineering Toolkit with its Python front end can do this.

DMS provides infrastructure for parsing code, parameterized by a language definition (e.g, a Python grammar, etc.) and automatically building ASTs, as well as the ability to inspect/navigate/change those ASTs, and prettyprint the resulting modified trees.

Its AST parsing machinery can handle a variety of special cases:

Parsing files or strings ("streams") as a (Python) full program. Syntax errors in a stream are reported, and if repairable by single token insertion or deletion, so repaired.
Parsing a stream according to an arbitrary language nonterminal.
Parsing a pattern, corresponding to a named grammar nonterminal with named placeholders for the missing subtrees. A pattern match result can be used to match against concrete ASTs to decide match or not, and if match, to provide bindings for the pattern variables.
Parsing a valid arbitrary substring. This returns a tree with possible missing left or right children, which define the left and right ends of the substring.

For instance, OP could write the following pattern to handle his example:

pattern if_x_is_5(s: statement):statement
  =  " if x==5: \s ";

DMS will read that pattern and build the corresponding pattern tree.

The paper that OP references really wants operators and keywords to remain as explicit artifacts in the AST. One way to interpret that is that they really want a concrete syntax tree. DMS actually produces "AST"s which are concrete syntax trees with the constant terminals removed; this has the effect of being very close to what a perfect AST should be, but one can easily determine for any leaf node where constant terminals should be inserted (or one can configure DMS to simply produce the uncompressed CSTs).

Personally, I don't see how the goal of the paper of OP's interest can really succeed in providing useful psuedo-code (in spite of its claims). Understanding an algorithm requires understanding of the corresponding data structures and the abstract and concrete algorithms being applied to those data structures. The paper focuses only on raw language syntax; there is no hint of understanding the more abstract ideas.

Obtain parse tree for python code

2 Answers2