Stanford NLP parse tree format

Question

This may be a silly question, but how does one iterate through a parse tree as an output of an NLP parser (like Stanford NLP)? It's all nested brackets, which is neither an array nor a dictionary or any other collection type I've used.

(ROOT\n  (S\n    (PP (IN As)\n      (NP (DT an) (NN accountant)))\n    (NP (PRP I))\n    (VP (VBP want)\n      (S\n        (VP (TO to)\n          (VP (VB make)\n            (NP (DT a) (NN payment))))))))

FWIW this is how nested lists are represented in Lisp. Imagine square brackets instead of round brackets and quotes around the tokens if that helps. — tripleee, Dec 22 '15 at 07:17
@tripleee out of curiosity, is there a native python regex or function to read Lisp like nested list in python? — alvas, Dec 24 '15 at 09:22
Definitely not regex! I could not find a built-in canned parser, but see http://stackoverflow.com/questions/3182594/parsing-s-expressions-in-python and https://sexpdata.readthedocs.org/en/latest/ — tripleee, Dec 24 '15 at 11:56

alvas · Accepted Answer · 2015-12-22T06:35:43.230

This particular output format of the Stanford Parser is call the "bracketed parse (tree)". It is supposed to be read as a graph with

words as nodes (e.g. As, an, accountant)
phrase/clause as labels (e.g. S, NP, VP)
edges are linked hierarchically and
typically the parses TOP or root node is a hallucinated ROOT

(In this case you can read it as a Directed Acyclic Graph (DAG) since it's unidirectional and non-cyclic)

There are libraries out there to read bracketed parse, e.g. in NLTK's nltk.tree.Tree (http://www.nltk.org/howto/tree.html):

>>> from nltk.tree import Tree
>>> output = '(ROOT (S (PP (IN As) (NP (DT an) (NN accountant))) (NP (PRP I)) (VP (VBP want) (S (VP (TO to) (VP (VB make) (NP (DT a) (NN payment))))))))'
>>> parsetree = Tree.fromstring(output)
>>> print parsetree
(ROOT
  (S
    (PP (IN As) (NP (DT an) (NN accountant)))
    (NP (PRP I))
    (VP
      (VBP want)
      (S (VP (TO to) (VP (VB make) (NP (DT a) (NN payment))))))))
>>> parsetree.pretty_print()
                           ROOT                             
                            |                                
                            S                               
      ______________________|________                        
     |                  |            VP                     
     |                  |    ________|____                   
     |                  |   |             S                 
     |                  |   |             |                  
     |                  |   |             VP                
     |                  |   |     ________|___               
     PP                 |   |    |            VP            
  ___|___               |   |    |    ________|___           
 |       NP             NP  |    |   |            NP        
 |    ___|______        |   |    |   |         ___|_____     
 IN  DT         NN     PRP VBP   TO  VB       DT        NN  
 |   |          |       |   |    |   |        |         |    
 As  an     accountant  I  want  to make      a      payment

>>> parsetree.leaves()
['As', 'an', 'accountant', 'I', 'want', 'to', 'make', 'a', 'payment']

score 3 · Answer 2 · answered May 04 '16 at 05:24

3

Note that if you're interested in specific nodes in the tree, identified by regex-like rules, you can use this very, very hand class to extract all such nodes using a regex-like matcher:

http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/tregex/TregexPattern.html

answered May 04 '16 at 05:24

John Stewart

343
2
11

you saved me countless hours of research. Thank you ! – Digao Dec 11 '18 at 01:18

Stanford NLP parse tree format

2 Answers2

Linked