1

The Penn Treebank format does not annotate the internal structure of a noun phrase, e.g.

(NP (JJ crude) (NN oil) (NNS prices))

or

(NP
    (NP (DT the) (JJ big) (JJ blue) (NN house))
    (SBAR
      (WHNP (WDT that))
      (S
        (VP (VBD was)
          (VP (VBN built)
            (PP (IN near)
              (NP (DT the) (NN river)))))))

I would like to extract the heads (prices and house). Do you know of any tool that can do this?

John Manak
  • 13,328
  • 29
  • 78
  • 119

3 Answers3

9

Michael Collins dissertation (Appendix A) includes head-finding rules for the Penn Treebank that work reasonably well and are not difficult to implement. They're far from perfect, though, since it's not the easiest task.

The work by David Vadas and James Curran on NP structure in the Penn Treebank could also be relevant:

gdrt
  • 3,160
  • 4
  • 37
  • 56
aab
  • 10,858
  • 22
  • 38
  • Thanks. Collins's first rule says: If the last word is tagged POS, return (last-word). I don't understand this one properly. For instance, in the second example in my question, the last word is "river" and it is tagged (NN). Does it mean it would pick this one as the head? That's clearly wrong. – John Manak Apr 25 '12 at 03:44
  • 1
    If the last word is tagged with the tag POS (possessive ending), this word would be marked as the head. For the NP "the river", "river" would be returned as the head in the right-to-left search for NN, NNP, etc. Why is "river" as the head of "the river" wrong? – aab Apr 25 '12 at 09:23
  • 1
    Ah, I see what you're asking now. The rules only look at a single context-free rule at a time, i.e. the immediate children of a given NP node, not further down in the tree. So, for the rule NP -> NP SBAR, which is the top-level NP, it would look right-to-left and mark as the head NP (NP -> NP' SBAR). Within "the big blue house", the whole process would be repeated to mark "house" as the head (NP -> DT JJ JJ NN'). – aab Apr 25 '12 at 12:39
  • Thanks for the clarification. I got confused by "POS". I initially thought Collins meant "part of speech", not "possessive". – John Manak Apr 26 '12 at 00:01
  • Do you know what Collins means by $? There's no '$' tag in Penn Treebank. – John Manak Jun 07 '12 at 07:39
  • Oh, that'd probably be for dollar amounts, e.g. $127. – John Manak Jun 07 '12 at 07:41
1

As aab suggested, simple deterministic head-finding rules can work quite well (also see references to Magerman or Charniak head-finding rules for similar approaches).

You might also look at extracting dependency structure from the constituent trees. The Stanford toolset does this quite well: See http://nlp.stanford.edu/software/stanford-dependencies.shtml

AaronD
  • 1,701
  • 13
  • 12
1

You can also find head finding rules of English in Dan Bikel 's thesis (if you need source code, you can find in his homepage in parser software)

Hoai-Thu Vuong
  • 1,928
  • 1
  • 13
  • 13