How to identify and remove trace trees from nltk.trees?

Question

For example I got this small tree (which is obviously a subtree only):

(VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*))))

Trace trees are those trees leading to a leaf of the shape: *.*. I now want to remove all subtrees which are a trace tree. So for this example the result should look like this:

(VP (VBZ says))

So far I extracted all those leaves:

from nltk.tree import ParentedTree
import re

traceLeaves = []    

line = "( (VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*)))))"
currTree = ParentedTree.fromstring(line, remove_empty_top_bracketing = True)
for leaf in currTree.leaves():
    if re.search('\*', leaf):
        traceLeaves.append(leaf)

but I got no idea how to navigate up the tree until there exists a sibling which is no trace tree and remove the trace tree from the original tree. I'm completely stuck here since I only started working with nltk...

EDIT:

Here is one complete sentence I want to be able to process:

( (SINV (S-3 (S (NP-SBJ-1 (-NONE- *PRO*)) (VP (VBG Assuming) (NP (DT that) (NN post)) (PP-TMP (IN at) (NP (NP (DT the) (NN age)) (PP (IN of) (NP (CD 35))))))) (, ,) (NP-SBJ-1 (PRP he)) (VP (VBD managed) (PP-MNR (IN by) (NP (NN consensus))) (, ,) (SBAR-ADV (IN as) (S (NP-SBJ (-NONE- *PRO*)) (VP (VBZ is) (NP-PRD (DT the) (NN rule)) (PP-LOC (IN in) (NP (NNS universities)))))))) (, ,) (VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*)))) (NP-SBJ (NP (NNP Warren) (NNP H.) (NNP Strother)) (, ,) (NP (NP (DT a) (NN university) (NN official)) (SBAR (WHNP-2 (WP who)) (S (NP-SBJ-2 (-NONE- *T*)) (VP (VBZ is) (VP (VBG researching) (NP (NP (DT a) (NN book)) (PP (IN on) (NP (NNP Mr.) (NNP Hahn)))))))))) (. .)) )

Is there a reason why you using`nltk.tree.ParentTree` instead of `nltk.Tree`? — alvas, Nov 27 '15 at 00:36
Since I need to check if specific trees have siblings aso. I thought it would be a better idea using ParentedTrees — LuZ, Nov 27 '15 at 08:41
Maybe have a look at http://stackoverflow.com/questions/25815002/nltk-tree-data-structure-finding-a-node-its-parent-or-children — b3000, Nov 27 '15 at 10:49

alexis · Accepted Answer · 2015-11-29T21:26:17.470

2

Leaves are regular strings, so they're no help for navigating the tree. Scan the tree and look for subtrees of height 2 instead.

To recognize what should be deleted, note that an nltk tree is a kind of list; so to see how many children a node has, just take its len(). When you find a trace leaf, move up the tree as long as the parent only has one child; then delete the subtree. This should cover everything since if a node dominates two trace trees and nothing else, it will contain only one after you delete the first :-)

The following has one more trick: Deleting a node confuses the for-loop, since the list of branches gets shorter. To prevent things from moving before we delete them, we scan the tree backwards.

for sub in reversed(list(t.subtrees())):
    if sub.height() == 2 and sub[0].startswith("*"):  # abbreviated test
        parent = sub.parent()
        while parent and len(parent) == 1:
            sub = parent
            parent = sub.parent()
        print(sub, "will be deleted")
        del t[sub.treeposition()]

edited Nov 29 '15 at 21:26

answered Nov 29 '15 at 00:24

alexis

48,685
16
101
161

Since I also came up with the 'tree.height() == 2'-idea this is exactly the point where I'm stuck atm... I was thinking about even using MultiparentedTrees to be able to deal with multiple siblings but this makes the parent treatment more complicated...!? – LuZ Nov 29 '15 at 14:25
One of my biggest problems is that I sometimes (in higher levels of the tree) need to compare siblings which haven't been processed so far. Thats why i thought it would be a nice idea to to iterate the tree by height of the subtrees in reverse... So far I also wanted to store all subtrees in a dict with 'isTrace' feature leading to several key Errors when comparing unseen trees... I'm really close to give up since this is driving me crazy..... :( Find my current progress as edit in the question.. – LuZ Nov 29 '15 at 14:36
Did you try my code? Does it fail to do something you want? What is that? (Give an example sentence.) As far as I can see it does exactly what you asked for. – alexis Nov 29 '15 at 20:30
Oh sorry... it seems my first comment disappeared due to copy paste error! ;) When I ran it on the full example i posted as an edit it ends up in an infinite loop. – LuZ Nov 29 '15 at 21:40
I noticed there was a bug (defined `parent` but forgot to update it in the loop). Try again with the corrected code (then check the alternative solution too :-)) – alexis Nov 29 '15 at 21:43
Seems to work well. Thanks a lot! What I do not understand so far is in which order the tree is iterated but I'll figure this out. I'm also still wondering how it's managed to process multiple siblings beeing eihter a trace tree or not without storing seen traces or having a look-ahead!? – LuZ Dec 01 '15 at 14:02
@LuZ, have it print out each node it visits and you'll (hopefully) understand. Multiple siblings are handled because by the time you get to the second one, the first one is gone-- so it looks like the common parent has only one child, and can now be deleted. – alexis Dec 01 '15 at 14:13
Omg... shame on me! It is so simple. I now realise that my approach was far too inconvenient. Thanks a lot again for your help! – LuZ Dec 03 '15 at 11:50
Yup, incremental editing is the key. Try to do it without touching the tree and it's a lot less simple. – alexis Dec 03 '15 at 12:16

score 0 · Answer 2 · edited May 23 '17 at 12:32

Here's a second solution, using Tree.leaf_treeposition() which maps a leaf index to a tree path (leaf scan based on this answer, suggested by @b3000). Again the tree is scanned backwards to allow editing in-place. Take your pick.

Finding the parent uses the fact that a "tree position" is a tuple of the branches to follow; so that if postn is the "position" of node n, postn[:-1] is the position of n's parent. This way, you don't even need a ParentedTree.

t = nltk.Tree.fromstring("(VP (VBZ says) (SBAR (-NONE- *0*) (S-3 (-NONE- *T*))))")

for ind, leaf in reversed(list(enumerate(t.leaves()))):
    if leaf.startswith("*") and leaf.endswith("*"):
        postn = t.leaf_treeposition(ind)
        parentpos = postn[:-1]
        while parentpos and len(t[parentpos]) == 1:
            postn = parentpos
            parentpos = postn[:-1]
        print(t[postn], "will be deleted")
        del t[postn]

How to identify and remove trace trees from nltk.trees?

2 Answers2

Linked