encode the structure of the parse trees

Question

I'm working on Stanford sentiment classification dataset and i'm trying to understand these two file STree.txt and SOStr.txt that encode the parse three of each sentence.

How can i decode for example this parse three?

 Effective|but|too-tepid|biopic

 6|6|5|5|7|7|0

the README file says that:

SOStr.txt and STree.txt encode the structure of the parse trees. STree encodes the trees in a parent pointer format. Each line corresponds to each sentence in the datasetSentences.txt file

is there a parser that convert a sentence in to this format? how can i decode this parse three?

i print a Constituency Tree of the previous sentence with this python script:

 with open( 'parents.txt') as parentsfile,\
  open( 'sents.txt') as toksfile:
       parents=[]
       toks =[]
       const_trees =[]
       for line in parentsfile:
           parents.append(map(int, line.split()))      
       for line in toksfile:
           toks.append(line.strip().split())
       for i in xrange(len(toks)):
           const_trees.append(load_constituency_tree(parents[i], toks[i]))

           #print (const_trees[i].left.word)
           attrs = vars(const_trees[i])
           print ', '.join("%s: %s" % item for item in attrs.items())

           attrs = vars(const_trees[i].right)
           print ', '.join("%s: %s" % item for item in attrs.items())

           attrs = vars(const_trees[i].left)
           print ', '.join("%s: %s" % item for item in attrs.items()) 

           attrs = vars(const_trees[i].right.right)
           print ', '.join("%s: %s" % item for item in attrs.items())

           attrs = vars(const_trees[i].right.left)
           print ', '.join("%s: %s" % item for item in attrs.items())

           attrs = vars(const_trees[i].left.left)
           print ', '.join("%s: %s" % item for item in attrs.items())

           attrs = vars(const_trees[i].left.right)
           print ', '.join("%s: %s" % item for item in attrs.items()) 

           break

and i realize that the tree for the first sentence is the following:

                              6
                              |
                +-------------+------------+
                |                          |
                5                          4
      +---------+---------+      +---------+---------+
      |                   |      |                   |
  Effective              but  too-tepid            biopic

like described in this post the non terminal are types of phrases but in this rappresentation of the tree these are index, maybe of a dictionary of types of phrases, my question is where is this dictionary? how can i convert this int in a types of phrases?

My solution: i'm noot sure that this is the solution but i write this fuction for convet an nltk PTree to the corrispondent parent pointer list:

# given the array returned by ptree.trepositions('postorder') of the nltk library i.e
# an array of tuple like this:
# [(0, 0), (0,), (1, 0, 0), (1, 0), (1, 1, 0), (1, 1, 1), (1, 1), (1,), ()]
# that describe the structure of a tree where each index of the array is the  index of a node in the tree in a postorder fashion
# return a list of parents for each node i.e [2, 9, 4, 8, 7, 7, 8, 9, 0] where 0 means that is the root.
# the previous array describe the structure for this tree
#             S
#  ___________|___
# |               VP
# |      _________|___
# NP    V             NP
# |     |          ___|____
# I  enjoyed      my     cookie


def make_parents_list(treepositions):
    parents = []
    for i in range(0,len(treepositions)):
        if len(treepositions[i])==0:
            parent = 0
            parents.append(parent)
        if len(treepositions[i])>0:
            parent_s = [j+1 for j in range(0,len(treepositions)) if ((j > i) and (len(treepositions[j]) == (len(treepositions[i])-1))) ]
            #print parent_s[0]
            parents.append(parent_s[0])
    return parents

encode the structure of the parse trees

0 Answers0