-1

I'm working with the Standford Sentiment Treebank dataset and I'm attempting to extract the leaves and the nodes. The data is given follows

(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))

what I would like so for something as follows:

i) The leaves with the label (uni-gram):

[(2 The), (2 Rock), (2 is), (2 destined),...]

ii) uper nodes with the labels (bi-gram):

[(2 (2 the) (2 Rock)), (2 (2 ``) (2 Conan)), (2 (2 Century) (2 's)),..] 

until I get to the root of the tree.

I've attempted to use regex to accomplish this but it fails to output correctly.

The code I have (for the uni-gram):

import re
import nltk

location = '.../NLP/Standford_Sentiment_Tree_Data_Set/' +\
           'trainDevTestTrees_PTB/trees/train.txt'
text = open(location, 'r')

test = text.readlines()[0]
text.close()

uni_regex = re.compile(r'(\([0-4] \w+\))')
temp01 = uni_regex.findall(test)

# bi-gram
bi_regex = re.compile(r'(\([0-4] \([0-4] \w+\) \([0-4] \w+\)\))')
temp02 = bi_regex.findall(test)

The above code outputs:

['(2 The)', '(2 Rock)', '(2 is)', '(2 destined)', '(2 to)', '(2 be)', '(2 the)', '(2 21st)', '(2 Century)', '(3 new)',...]

and fails to capture (2 ``), (2 '') and extracts (2 Jean) instead of (2 Jean-Claude)

The output fails to capture (2 (2``) (2 Conan))

Is there a way to get the result that I want using nltk or some configuration of regex that will not miss any tokens?

I've had a look and attempted to modify the solution provided in NLTK tree data structure, finding a node, it's parent or children but that question seems to deal with finding a specific word in a leave and the displaying the tree structure, whereas I require the indented solution to resemble the above n-grams.

Community
  • 1
  • 1
Lukasz
  • 2,476
  • 10
  • 41
  • 51

1 Answers1

2

Don't waste your time with regexps, this is what tree classes are for. Use the nltk's Tree class like this:

mytree = "(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))"

>>> t = nltk.Tree.fromstring(mytree)
>>> print(t)
(3
  (2 (2 The) (2 Rock))
  (4
    (3
      (2 is)
      (4
        (2 destined)
        (2
          ...

You can then extract and count the leaves, and request the corresponding "treepositions" (the path to each leaf, in the form of a list):

>>> leafpos = [ t.leaf_treeposition(n) for n, x in enumerate(t.leaves()) ]
>>> print(leafpos[0:3])
[(0, 0, 0), (0, 1, 0), (1, 0, 0, 0)]

Finally, you can walk up the treepositions to get the units you want: the subtree dominated by the node immediately above each leaf, two steps above each leaf, etc:

>>> level1_subtrees = [ t[path[:-1]] for path in leafpos ]
>>> for x in level1_subtrees:
...     print(x, end = " ")
(2 The) (2 Rock) (2 is) (2 destined) (2 to) (2 be) (2 the) ...

>>> level2_subtrees = [ t[path[:-2]] for path in leafpos ]

Note, however, that higher-level subtrees don't look like you imagine. If you go up two levels from leaf 3 (destined), for example, you won't get a "bigram". You'll be at the node labeled 4, which dominates most of the rest of the sentence. Perhaps you're actually interested in enumerating all subtrees? In that case, just iterate over t.subtrees().

If that's not what you want, take a look at the Tree API and pick out another way to select the parts you need.

alexis
  • 48,685
  • 16
  • 101
  • 161