Finding head of a noun phrase in NLTK and stanford parse according to the rules of finding head of a NP

Question

generally A head of a nounphrase is a noun which is rightmost of the NP as shown below tree is the head of the parent NP. So

            ROOT                             
             |                                
             S                               
          ___|________________________        
         NP                           |      
      ___|_____________               |       
     |                 PP             VP     
     |             ____|____      ____|___    
     NP           |         NP   |       PRT 
  ___|_______     |         |    |        |   
 DT  JJ  NN  NN   IN       NNP  VBD       RP 
 |   |   |   |    |         |    |        |   
The old oak tree from     India fell     down

Out[40]: Tree('S', [Tree('NP', [Tree('NP', [Tree('DT', ['The']), Tree('JJ', ['old']), Tree('NN', ['oak']), Tree('NN', ['tree'])]), Tree('PP', [Tree('IN', ['from']), Tree('NP', [Tree('NNP', ['India'])])])]), Tree('VP', [Tree('VBD', ['fell']), Tree('PRT', [Tree('RP', ['down'])])])])

The following code based on a java implementation uses a simplistic rule to find the head of the NP , but i need to be based on the rules:

parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
def traverse(t):
    try:
        t.label()
    except AttributeError:
          return
    else:
        if t.label()=='NP':
            print 'NP:'+str(t.leaves())
            print 'NPhead:'+str(t.leaves()[-1])
            for child in t:
                 traverse(child)

        else:
            for child in t:
                traverse(child)


tree=Tree.fromstring(parsestr)
traverse(tree)

The above code gives output:

NP:['The', 'old', 'oak', 'tree', 'from', 'India'] NPhead:India NP:['The', 'old', 'oak', 'tree'] NPhead:tree NP:['India'] NPhead:India

Although now its giving correct output for the sentence given but I need to incorporate a condition that only right most noun is extracted as head , currently it does not check if it were a noun (NN)

print 'NPhead:'+str(t.leaves()[-1])

So something like following in the np head condition in above code:

t.leaves().getrightmostnoun()

Michael Collins dissertation (Appendix A) includes head-finding rules for the Penn Treebank, and hence it is not necessary that only the rightmost noun is the head. Hence the above conditions should incorporate such scenario.

For the following example as given in one of the answers:

(NP (NP the person) that gave (NP the talk)) went home

The head noun of the subject is person but the last leave node of the NP the person that gave the talk is talk.

Please read the help page http://stackoverflow.com/help/mcve. In this case, show the output you *do* get: "doesn't work" is not sufficient for StackOverflow. Also, please try adding more print statements to your code (such as one just before you traverse(child), and another on entry to traverse). Post the output of that execution trace -- provided it doesn't immediately show *you* the problem. — Prune, Sep 19 '15 at 00:28

alvas · Answer 1 · 2015-09-19T09:55:36.163

8

There are built-in string to Tree object in NLTK (http://www.nltk.org/_modules/nltk/tree.html), see https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L541.

>>> from nltk.tree import Tree
>>> parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
>>> for i in Tree.fromstring(parsestr).subtrees():
...     if i.label() == 'NP':
...             print i
... 
(NP
  (NP (DT The) (JJ old) (NN oak) (NN tree))
  (PP (IN from) (NP (NNP India))))
(NP (DT The) (JJ old) (NN oak) (NN tree))
(NP (NNP India))


>>> for i in Tree.fromstring(parsestr).subtrees():
...     if i.label() == 'NP':
...             print i.leaves()
... 
['The', 'old', 'oak', 'tree', 'from', 'India']
['The', 'old', 'oak', 'tree']
['India']

Note that it's not always the case that right most noun is the head noun of an NP, e.g.

>>> s = '(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))'
>>> Tree.fromstring(s)
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Carnac']), Tree('DT', ['the']), Tree('NN', ['Magnificent'])]), Tree('VP', [Tree('VBD', ['gave']), Tree('NP', [Tree('', [Tree('DT', ['a']), Tree('NN', ['talk'])])])])])])
>>> for i in Tree.fromstring(s).subtrees():
...     if i.label() == 'NP':
...             print i.leaves()[-1]
... 
Magnificent
talk

Arguably, Magnificent can still be the head noun. Another example is when the NP includes a relative clause:

(NP (NP the person) that gave (NP the talk)) went home

The head noun of the subject is person but the last leave node of the NP the person that gave the talk is talk.

edited Sep 19 '15 at 09:55

answered Sep 19 '15 at 09:01

alvas

115,346
109
446
738

i have done it finally as shown in the code but just need to add a condition to check if rightmost is a NN – stackit Sep 19 '15 at 09:04
So something like following in the np head condition in above code: t.leaves().getrightmostnoun() – stackit Sep 19 '15 at 09:07
Note that it's not always the case that right most noun is the head noun of an NP! – alvas Sep 19 '15 at 09:35
yes , thats the reason I asked this question ! i need to incorporate all the rules \possibilities, being rightmost is one of them – stackit Sep 19 '15 at 09:37
2

Michael Collins dissertation (Appendix A) includes head-finding rules for the Penn Treebank, and hence it is not necessary that only the rightmost noun is the head3 – stackit Sep 19 '15 at 09:50
If you want, you can easily implement Michael Collin's rules and then push it back to NLTK =) – alvas Sep 19 '15 at 10:02
he writes rules like else search from left to right ... then what? – stackit Sep 19 '15 at 10:03
2

Ask politely on the NLTK github issue for someone help implement it if you're having troubles. Better yet, try implementing, do a pull request with your working code and ask for a code review and i'm sure NLTK dev will help you out with it. Or wait until someone else code it =) – alvas Sep 19 '15 at 10:06
even the collins rules seem to fail in the test sentence given by you – stackit Sep 19 '15 at 10:47
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/90143/discussion-between-stackit-and-alvas). – stackit Sep 20 '15 at 05:56

score 1 · Answer 2 · answered May 18 '16 at 10:25

I was looking for a python script using NLTK that does this task and stumbled across this post. Here's the solution I came up with. It's a little bit noisy and arbitrary, and definitely doesn't always pick the right answer (e.g. for compound nouns). But I wanted to post it in case it was helpful for others to have a solution that mostly works.

#!/usr/bin/env python

from nltk.tree import Tree

examples = [
    '(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))',
    "(ROOT\n  (S\n    (NP\n      (NP (DT the) (NN person))\n      (SBAR\n        (WHNP (WDT that))\n        (S\n          (VP (VBD gave)\n            (NP (DT the) (NN talk))))))\n    (VP (VBD went)\n      (NP (NN home)))))",
    '(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))'
]

def find_noun_phrases(tree):
    return [subtree for subtree in tree.subtrees(lambda t: t.label()=='NP')]

def find_head_of_np(np):
    noun_tags = ['NN', 'NNS', 'NNP', 'NNPS']
    top_level_trees = [np[i] for i in range(len(np)) if type(np[i]) is Tree]
    ## search for a top-level noun
    top_level_nouns = [t for t in top_level_trees if t.label() in noun_tags]
    if len(top_level_nouns) > 0:
        ## if you find some, pick the rightmost one, just 'cause
        return top_level_nouns[-1][0]
    else:
        ## search for a top-level np
        top_level_nps = [t for t in top_level_trees if t.label()=='NP']
        if len(top_level_nps) > 0:
            ## if you find some, pick the head of the rightmost one, just 'cause
            return find_head_of_np(top_level_nps[-1])
        else:
            ## search for any noun
            nouns = [p[0] for p in np.pos() if p[1] in noun_tags]
            if len(nouns) > 0:
                ## if you find some, pick the rightmost one, just 'cause
                return nouns[-1]
            else:
                ## return the rightmost word, just 'cause
                return np.leaves()[-1]

for example in examples:
    tree = Tree.fromstring(example)
    for np in find_noun_phrases(tree):
        print "noun phrase:",
        print " ".join(np.leaves())
        head = find_head_of_np(np)
        print "head:",
        print head

For the examples discussed in the question and in the other answers, this is the output:

noun phrase: The old oak tree from India
head: tree
noun phrase: The old oak tree
head: tree
noun phrase: India
head: India
noun phrase: the person that gave the talk
head: person
noun phrase: the person
head: person
noun phrase: the talk
head: talk
noun phrase: home
head: home
noun phrase: Carnac the Magnificent
head: Magnificent
noun phrase: a talk
head: talk

Finding head of a noun phrase in NLTK and stanford parse according to the rules of finding head of a NP

2 Answers2

Linked