How to get the nodes of a nltk tree without their grammatical form?

Question

I managed to make a class that creates a tree from spaCy and I would like to keep in the nodes only the words and not the whole thing with the grammar. That is to say have start from start_VB_ROOT.

To generalize, for instance with the sentence When did Beyonce start becoming popular? the input is

[Tree('start_VB_ROOT', ['When_WRB_advmod', 'did_VBD_aux', 'Beyonce_NNP_nsubj', Tree('becoming_VBG_xcomp', ['popular_JJ_acomp']), '?_._punct'])]

And the expected output with the function I provided below would be a tree :

<class 'str'> When_WRB_advmod
son creation : When
<class 'str'> did_VBD_aux
son creation : did
<class 'str'> Beyonce_NNP_nsubj
son creation : Beyonce
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
son: becoming_VBG_xcomp
<class 'str'> popular_JJ_acomp
son creation popular
end of sub tree creation
<class 'str'> ?_._punct
son creation ?

Here is the function

class WordTree:
    '''Tree for spaCy dependency parsing array'''
    def __init__(self, array, parent = None):
        """
        Construct a new 'WordTree' object.

        :param array: The array contening the dependency
        :param parent: The parent of the array if exists
        :return: returns nothing
        """
        self.parent = []
        self.children = []
        self.data = array

        for element in array[0]:
            print(type(element),element)
            # we check if we got a subtree
            if type(element) is Tree:
                print("sub tree creation")
                self.children.append(element.label())
                print("son:",element.label())
                t = WordTree([element],element.label()) # should I verify if parent is empty ?
                print("end of sub tree creation")
            # else if we have a string we create a son
            elif type(element) is str:
                print("son creation",element)
                self.children.append(element)
            # in other case we have a problem
            else:
                print("issue?")
                break

Which gives the following output at the moment :

<class 'str'> When_WRB_advmod
son creation When_WRB_advmod
<class 'str'> did_VBD_aux
son creation did_VBD_aux
<class 'str'> Beyonce_NNP_nsubj
son creation Beyonce_NNP_nsubj
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
son: becoming_VBG_xcomp
<class 'str'> popular_JJ_acomp
son creation popular_JJ_acomp
end of sub tree creation
<class 'str'> ?_._punct
son creation ?_._punct

@alvas Initially, the desired output is to get the the word from its dependency parsing form. For instance `start` as a string from `start_VB_ROOT`. In a second step, implement it in the algorithm. I used the _python tree_ to distinghuish with the nltk.Tree — Revolucion for Monica, Aug 30 '18 at 02:14
Post the desired output in the question. It's still unclear what you're trying to achieve from your comment =) Also, post the input to get the WordTree object you've posted on the question. — alvas, Aug 30 '18 at 02:16
If the object comes from SpaCy, then there's not much point converting it into NLTK tree just to get the string (aka surface forms / leaves) — alvas, Aug 30 '18 at 02:17
What is the purpose of the desired output? Just for humans to read? If it's something that you need to process again. Then it's not useful to produce that output. Also, what's the actual SpaCy object? It's not useful to parse an output from another tool (SpaCy) to parse it into another output using another tool (NLTK) and then use the output to do something else again. — alvas, Aug 30 '18 at 02:47
Reason why I'm asking is because there's a way to get the desired output but if it's just used for human reading it's fine. Otherwise, it's senseless and answering the question would have lead future readers of the question in the wrong direction. — alvas, Aug 30 '18 at 02:48
@alvas Thank you for your answer, comments and insights. The purpose of the output is to be able to compare it to other trees. For instance if I have a sentence P_i and another one Q I want to know if one is in another. — Revolucion for Monica, Aug 30 '18 at 10:14
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/179117/discussion-between-thepassenger-and-alvas). — Revolucion for Monica, Aug 30 '18 at 15:47

alvas · Accepted Answer · 2018-08-30T03:42:30.193

First, note that the SpaCy "grammatical forms" from the question are actually the surface token appended with the POS tag and dependency tag. In that case, you should just retrieve the Tree.leaves() and Tree.label() object in nltk.

But it'll be easier to manipulate the original output of the SpaCy parser rather than messing around the data format as in the question.

See How to Traverse an NLTK Tree object? before continuing, think recursion (without classes) when doing depth-first traversal.

For future reader, please read the comments in the question before continuing to the answer below.

If you readlly want to simply remove the POS and dependency tag from the leaves and labels, try this:

from nltk import Tree

parse = Tree('start_VB_ROOT', 
                 ['When_WRB_advmod', 'did_VBD_aux', 'Beyonce_NNP_nsubj', 
                 Tree('becoming_VBG_xcomp', 
                      ['popular_JJ_acomp']), 
                  '?_._punct']
            )

def traverse_tree(tree, is_subtree=False):
    for subtree in tree:
        print(type(subtree), subtree)
        if type(subtree) == Tree:
            # Iterate through the depth of the subtree.
            print('sub tree creation')
            traverse_tree(subtree, True)
            print('end of sub tree creation')
        elif type(subtree) == str:
            surface_form = subtree.split('_')[0]
            print('son creation:', surface_form)

traverse_tree(parse)

[out]:

<class 'str'> When_WRB_advmod
son creation: When
<class 'str'> did_VBD_aux
son creation: did
<class 'str'> Beyonce_NNP_nsubj
son creation: Beyonce
<class 'nltk.tree.Tree'> (becoming_VBG_xcomp popular_JJ_acomp)
sub tree creation
<class 'str'> popular_JJ_acomp
son creation: popular
end of sub tree creation
<class 'str'> ?_._punct
son creation: ?

Great ! Yet, how to show that `start` is included as the parent of all ? — Revolucion for Monica, Aug 30 '18 at 15:43
Try it for yourself. It shouldn't be hard to know where to print the "start" of a Tree. (Hint: It's near `if type(subtree) == Tree` and `Tree.label()` is useful). I believe in you ;P — alvas, Aug 30 '18 at 21:30

How to get the nodes of a nltk tree without their grammatical form?

1 Answers1