9

Given an NLP parse tree like

(ROOT (S (NP (PRP You)) (VP (MD could) (VP (VB say) (SBAR (IN that) (S (NP (PRP they)) (ADVP (RB regularly)) (VP (VB catch) (NP (NP (DT a) (NN shower)) (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ adds) (PP (TO to) (NP (NP (PRP$ their) (NN exhilaration)) (CC and) (NP (FW joie) (FW de) (FW vivre))))))))))))) (. .)))

Original sentence is "You could say that they regularly catch a shower, which adds to their exhilaration and joie de vivre."

How could the clauses be extracted and reverse engineered? We would be splitting at S and SBAR (to preserve the type of clause, eg subordinated)

 - (S (NP (PRP You)) (VP (MD could) (VP (VB say) 
 - (SBAR (IN that) (S (NP (PRP they)) (ADVP (RB regularly)) (VP (VB catch) (NP (NP (DT a) (NN shower))
 - (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ adds) (PP (TO to)
   (NP (NP (PRP$ their) (NN exhilaration)) (CC and) (NP (FW joie) (FW
   de) (FW vivre))))))))))))) (. .)))

to arrive at

 - You could say
 - that they regularly catch a shower 
 - , which adds to their exhilaration and joie de vivre.

Splitting at S and SBAR seems very easy. The problem seems to be stripping away all the POS tags and chunks from the fragments.

giorgio79
  • 3,787
  • 9
  • 53
  • 85

2 Answers2

11

You can use Tree.subtrees(). For more information check NLTK Tree Class.

Code:

from nltk import Tree

parse_str = "(ROOT (S (NP (PRP You)) (VP (MD could) (VP (VB say) (SBAR (IN that) (S (NP (PRP they)) (ADVP (RB regularly)) (VP (VB catch) (NP (NP (DT a) (NN shower)) (, ,) (SBAR (WHNP (WDT which)) (S (VP (VBZ adds) (PP (TO to) (NP (NP (PRP$ their) (NN exhilaration)) (CC and) (NP (FW joie) (FW de) (FW vivre))))))))))))) (. .)))"
#parse_str = "(ROOT (S (SBAR (IN Though) (S (NP (PRP he)) (VP (VBD was) (ADJP (RB very) (JJ rich))))) (, ,) (NP (PRP he)) (VP (VBD was) (ADVP (RB still)) (ADJP (RB very) (JJ unhappy))) (. .)))"

t = Tree.fromstring(parse_str)
#print t

subtexts = []
for subtree in t.subtrees():
    if subtree.label()=="S" or subtree.label()=="SBAR":
        #print subtree.leaves()
        subtexts.append(' '.join(subtree.leaves()))
#print subtexts

presubtexts = subtexts[:]       # ADDED IN EDIT for leftover check

for i in reversed(range(len(subtexts)-1)):
    subtexts[i] = subtexts[i][0:subtexts[i].index(subtexts[i+1])]

for text in subtexts:
    print text

# ADDED IN EDIT - Not sure for generalized cases
leftover = presubtexts[0][presubtexts[0].index(presubtexts[1])+len(presubtexts[1]):]
print leftover

Output:

You could say 
that 
they regularly catch a shower , 
which 
adds to their exhilaration and joie de vivre
 .
RAVI
  • 3,143
  • 4
  • 25
  • 38
  • Wow! Amazing! @RAVI you are quite the NLP Guru! Where can I reach you? :) – giorgio79 Sep 04 '16 at 18:52
  • 1
    I noticed this algo fails on some parses like this `(ROOT (S (SBAR (IN Though) (S (NP (PRP he)) (VP (VBD was) (ADJP (RB very) (JJ rich))))) (, ,) (NP (PRP he)) (VP (VBD was) (ADVP (RB still)) (ADJP (RB very) (JJ unhappy))) (. .)))` – giorgio79 Sep 13 '16 at 14:30
  • Updated Answer. – RAVI Sep 13 '16 at 15:26
  • @RAVI Not the right place to ask this question but would you know how to extract clauses like above using the Stanford Parser in java? – serendipity Sep 15 '17 at 10:48
  • 1
    How I can get parse_str like your example if I have just a string sentence. I want to do exactly the same thing but I just have the raw sentences. – xzegga Mar 19 '18 at 21:52
0

First get parse tree:

# stanza.install_corenlp()

from stanza.server import CoreNLPClient

text = "Joe realized that the train was late while he waited at the train station"

with CoreNLPClient(
        annotators=['tokenize', 'pos', 'lemma', 'parse', 'depparse'],
        output_format="json",
        timeout=30000,
        memory='16G') as client:
    output = client.annotate(text)
    # print(output.sentence[0])
    parse_tree = output['sentences'][0]['parse']
    parse_tree = ' '.join(parse_tree.split())

Then use this gist to extract clauses by calling:

print_clauses(parse_str=parse_tree)

The output will be:

{'the train was late', 'he waited at the train station', 'Joe realized'}
Amir
  • 16,067
  • 10
  • 80
  • 119