1

I am employing Stanford Parser to parse Chinese texts. I want to extract the Context-free Grammar Production Rules from the input Chinese texts.

I set my environment just as Stanford Parser and NLTK.

My code is below:

from nltk.parse import stanford
parser = stanford.StanfordParser(path_to_jar='/home/stanford-parser-full-2013-11-12/stanford-parser.jar', 
                                 path_to_models_jar='/home/stanford-parser-full-2013-11-12/stanford-parser-3.3.0-models.jar',
                                 model_path="/home/stanford-parser-full-2013-11-12/chinesePCFG.ser.gz",encoding='utf8')

text = '我 对 这个 游戏 有 一 点 上瘾。'
sentences = parser.raw_parse_sents(unicode(text, encoding='utf8'))

However,when I try to

print sentences

I get

[Tree('ROOT', [Tree('IP', [Tree('NP', [Tree('PN', ['\u6211'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VA', ['\u5bf9'])])])]), Tree('ROOT', [Tree('IP', [Tree('NP', [Tree('PN', ['\u8fd9'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('QP', [Tree('CLP', [Tree('M', ['\u4e2a'])])])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VV', ['\u6e38'])])])]), Tree('ROOT', [Tree('FRAG', [Tree('NP', [Tree('NN', ['\u620f'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VE', ['\u6709'])])])]), Tree('ROOT', [Tree('FRAG', [Tree('QP', [Tree('CD', ['\u4e00'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VV', ['\u70b9'])])])]), Tree('ROOT', [Tree('IP', [Tree('VP', [Tree('VV', ['\u4e0a'])])])]), Tree('ROOT', [Tree('FRAG', [Tree('NP', [Tree('NN', ['\u763e'])])])]), Tree('ROOT', [Tree('IP', [Tree('NP', [Tree('PU', ['\u3002'])])])])]

in which, Chinese words are divided separately from each other. There should be 9 subtrees but in fact 12 subtrees are returned. Could anyone show me what the problem is?

Continue, I try to collect all Context-free Grammar Production Rules from it.

for subtree in sentences:
    for production in subtree.productions():
        lst.append(production)
print lst

[ROOT -> IP, IP -> NP, NP -> PN, PN -> '\u6211', ROOT -> IP, IP -> VP, VP -> VA, VA -> '\u5bf9', ROOT -> IP, IP -> NP, NP -> PN, PN -> '\u8fd9', ROOT -> IP, IP -> VP, VP -> QP, QP -> CLP, CLP -> M, M -> '\u4e2a', ROOT -> IP, IP -> VP, VP -> VV, VV -> '\u6e38', ROOT -> FRAG, FRAG -> NP, NP -> NN, NN -> '\u620f', ROOT -> IP, IP -> VP, VP -> VE, VE -> '\u6709', ROOT -> FRAG, FRAG -> QP, QP -> CD, CD -> '\u4e00', ROOT -> IP, IP -> VP, VP -> VV, VV -> '\u70b9', ROOT -> IP, IP -> VP, VP -> VV, VV -> '\u4e0a', ROOT -> FRAG, FRAG -> NP, NP -> NN, NN -> '\u763e', ROOT -> IP, IP -> NP, NP -> PU, PU -> '\u3002'] 

But still Chinese words are divided separately.

Since I do not have much knowledge on Java, I have to use Python interface to implement my task.I really need help from stackoverflow community. Could anyone help me with it?

Community
  • 1
  • 1
allenwang
  • 727
  • 2
  • 8
  • 25

1 Answers1

0

I have found the solution: use parser.raw_parse instead of parser.raw_parse_sents will solve the problem. Because parser.raw_parse_sents is used for list.

maple
  • 1,828
  • 2
  • 19
  • 28
  • Hey bro, I tried your solution. But it did not work at all. I got the result as same as the one on this post. Could you please show your source code and result concretely? – allenwang Dec 13 '15 at 10:20
  • I really need your help – allenwang Dec 13 '15 at 10:20
  • I use stanford-parser-full-2014-10-31. There are some problems with the recent version, and I think you may use the recent version. – maple Dec 14 '15 at 05:45
  • I have already tried the newest version. But it doesn't work could u please show me some details. I mean could u please display ur code and the outcome? – allenwang Dec 14 '15 at 05:50