I'm working with the Standford Sentiment Treebank dataset and I'm attempting to extract the leaves and the nodes. The data is given follows
(3 (2 (2 The) (2 Rock)) (4 (3 (2 is) (4 (2 destined) (2 (2 (2 (2 (2 to) (2 (2 be) (2 (2 the) (2 (2 21st) (2 (2 (2 Century) (2 's)) (2 (3 new) (2 (2 ``) (2 Conan)))))))) (2 '')) (2 and)) (3 (2 that) (3 (2 he) (3 (2 's) (3 (2 going) (3 (2 to) (4 (3 (2 make) (3 (3 (2 a) (3 splash)) (2 (2 even) (3 greater)))) (2 (2 than) (2 (2 (2 (2 (1 (2 Arnold) (2 Schwarzenegger)) (2 ,)) (2 (2 Jean-Claud) (2 (2 Van) (2 Damme)))) (2 or)) (2 (2 Steven) (2 Segal))))))))))))) (2 .)))
what I would like so for something as follows:
i) The leaves with the label (uni-gram):
[(2 The), (2 Rock), (2 is), (2 destined),...]
ii) uper nodes with the labels (bi-gram):
[(2 (2 the) (2 Rock)), (2 (2 ``) (2 Conan)), (2 (2 Century) (2 's)),..]
until I get to the root of the tree.
I've attempted to use regex to accomplish this but it fails to output correctly.
The code I have (for the uni-gram):
import re
import nltk
location = '.../NLP/Standford_Sentiment_Tree_Data_Set/' +\
'trainDevTestTrees_PTB/trees/train.txt'
text = open(location, 'r')
test = text.readlines()[0]
text.close()
uni_regex = re.compile(r'(\([0-4] \w+\))')
temp01 = uni_regex.findall(test)
# bi-gram
bi_regex = re.compile(r'(\([0-4] \([0-4] \w+\) \([0-4] \w+\)\))')
temp02 = bi_regex.findall(test)
The above code outputs:
['(2 The)', '(2 Rock)', '(2 is)', '(2 destined)', '(2 to)', '(2 be)', '(2 the)', '(2 21st)', '(2 Century)', '(3 new)',...]
and fails to capture (2 ``)
, (2 '')
and extracts (2 Jean)
instead of (2 Jean-Claude)
The output fails to capture (2 (2``) (2 Conan))
Is there a way to get the result that I want using nltk
or some configuration of regex
that will not miss any tokens?
I've had a look and attempted to modify the solution provided in NLTK tree data structure, finding a node, it's parent or children but that question seems to deal with finding a specific word in a leave and the displaying the tree structure, whereas I require the indented solution to resemble the above n-grams.