4

I'm working on Stanford sentiment classification dataset and i'm trying to understand these two file STree.txt and SOStr.txt that encode the parse three of each sentence.

How can i decode for example this parse three?

 Effective|but|too-tepid|biopic

 6|6|5|5|7|7|0

the README file says that:

  1. SOStr.txt and STree.txt encode the structure of the parse trees. STree encodes the trees in a parent pointer format. Each line corresponds to each sentence in the datasetSentences.txt file

is there a parser that convert a sentence in to this format? how can i decode this parse three?

i print a Constituency Tree of the previous sentence with this python script:

 with open( 'parents.txt') as parentsfile,\
  open( 'sents.txt') as toksfile:
       parents=[]
       toks =[]
       const_trees =[]
       for line in parentsfile:
           parents.append(map(int, line.split()))      
       for line in toksfile:
           toks.append(line.strip().split())
       for i in xrange(len(toks)):
           const_trees.append(load_constituency_tree(parents[i], toks[i]))

           #print (const_trees[i].left.word)
           attrs = vars(const_trees[i])
           print ', '.join("%s: %s" % item for item in attrs.items())

           attrs = vars(const_trees[i].right)
           print ', '.join("%s: %s" % item for item in attrs.items())

           attrs = vars(const_trees[i].left)
           print ', '.join("%s: %s" % item for item in attrs.items()) 

           attrs = vars(const_trees[i].right.right)
           print ', '.join("%s: %s" % item for item in attrs.items())

           attrs = vars(const_trees[i].right.left)
           print ', '.join("%s: %s" % item for item in attrs.items())

           attrs = vars(const_trees[i].left.left)
           print ', '.join("%s: %s" % item for item in attrs.items())

           attrs = vars(const_trees[i].left.right)
           print ', '.join("%s: %s" % item for item in attrs.items()) 

           break

and i realize that the tree for the first sentence is the following:

                              6
                              |
                +-------------+------------+
                |                          |
                5                          4
      +---------+---------+      +---------+---------+
      |                   |      |                   |
  Effective              but  too-tepid            biopic

like described in this post the non terminal are types of phrases but in this rappresentation of the tree these are index, maybe of a dictionary of types of phrases, my question is where is this dictionary? how can i convert this int in a types of phrases?

My solution: i'm noot sure that this is the solution but i write this fuction for convet an nltk PTree to the corrispondent parent pointer list:

# given the array returned by ptree.trepositions('postorder') of the nltk library i.e
# an array of tuple like this:
# [(0, 0), (0,), (1, 0, 0), (1, 0), (1, 1, 0), (1, 1, 1), (1, 1), (1,), ()]
# that describe the structure of a tree where each index of the array is the  index of a node in the tree in a postorder fashion
# return a list of parents for each node i.e [2, 9, 4, 8, 7, 7, 8, 9, 0] where 0 means that is the root.
# the previous array describe the structure for this tree
#             S
#  ___________|___
# |               VP
# |      _________|___
# NP    V             NP
# |     |          ___|____
# I  enjoyed      my     cookie


def make_parents_list(treepositions):
    parents = []
    for i in range(0,len(treepositions)):
        if len(treepositions[i])==0:
            parent = 0
            parents.append(parent)
        if len(treepositions[i])>0:
            parent_s = [j+1 for j in range(0,len(treepositions)) if ((j > i) and (len(treepositions[j]) == (len(treepositions[i])-1))) ]
            #print parent_s[0]
            parents.append(parent_s[0])
    return parents
Alberto Merciai
  • 474
  • 1
  • 5
  • 17

0 Answers0