1

I'm using Stanford parser from the command line:

java -mx1500m -cp stanford-parser.jar;stanford-parser-models.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn"  edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz {file}

When I'm running the command on a single sentence with 27 words, The Java process is consuming 100MB of memory and parsing takes 1.5 seconds. When I'm running the command on a single sentence with 148 words, The Java process is consuming 1.5GB of memory, and parsing takes 1.5 minutes.

The machine I'm using is windows 7 with intel i5 2.53GH.

Are these processing times reasonable? Is there any official performance benchmark for the parser?

mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
User-10000000
  • 33
  • 1
  • 5
  • 1
    (1) Given the 1.5 seconds for 27 words, it sounds reasonable. And you would expect the possible parses to increase a lot more with more words per sentence. Question is why are you parsing a sentence with 148 words? It's possibly not a very natural sentence for natural language processing. (2) No one likes to benchmark NLP tools, it's bulky and wouldn't be realtime unless you do some distributed computing trick. – alvas Jun 13 '13 at 13:09
  • agreed with @2er0: 148 is probably a too long sentence. Can you give us the sentence? – Renaud Jun 13 '13 at 14:37
  • Thanks, @2er0 and Renaud . I think you answered my question. I was just wanted to verify that 1.5 seconds to 27 word sentence is reasonable and that I'm not doing something completely wrong. I agree that 148 words sentence is not reasonable. The reason I'm parsing such long sentences is because I have a system that can receive any input. When the sentences are not punctuated with a dot at the end, the NLP engine cannot split them correctly - This how I get such long sentences sometimes. – User-10000000 Jun 14 '13 at 07:44
  • The sentence is too long so I need to split it into 2 comment: "Something gotta change It must be rearranged I'm sorry, I did not mean to hurt my little girl It's beyond me I cannot carry the weight of the heavy world So good night, good night, good night, good night Good night, good night, good night, good night, good night Hope that things work out all right So much to love, so much to learn But I won't be there to teach you Oh, I know I can be close But I try my best to reach you I'm so sorry I didn't not mean to hurt my little girl It's beyond me – User-10000000 Jun 14 '13 at 07:47
  • I cannot carry the weight of the heavy world So good night, good night, good night, good night Good night, good night, good night, good night Good night, good night, good night good night, good night Hope that things work out all right, yeah Thank you." – User-10000000 Jun 14 '13 at 07:48
  • You have a sentence segmentation problem before your parsing. – alvas Jun 14 '13 at 10:32
  • I agree with all comments and I would add that there is absolutely no chance that any parser parses correctly a 148 words sentence. So you have nothing to lose by segmenting into smaller sentences. – Blacksad Jun 14 '13 at 15:51
  • As everyone says, you need to sentence segment, but if part of the goal is just to never have it take forever on long sentences, you can set a limit to the length that will be parsed, e.g., `-maxLength 50`. Longer sentences will be ignored (command-line) or given a flat parse. – Christopher Manning Sep 22 '13 at 20:52

1 Answers1

2

As commented, your problem lies in sentence segmentation since your data allows any input (with/without proper punctuation). But somehow it's nice that you have capitalization. So you can try the recipe below to segment sentence by capitalization.

Disclaimer: If your sentence starts with I , then the recipe below isn't going to help much =)

"Something gotta change It must be rearranged I'm sorry, I did not mean to hurt my little girl It's beyond me I cannot carry the weight of the heavy world So good night, good night, good night, good night Good night, good night, good night, good night, good night Hope that things work out all right So much to love, so much to learn But I won't be there to teach you Oh, I know I can be close But I try my best to reach you I'm so sorry I didn't not mean to hurt my little girl It's beyond me I cannot carry the weight of the heavy world So good night, good night, good night, good night Good night, good night, good night, good night Good night, good night, good night good night, good night Hope that things work out all right, yeah Thank you."

In Python, you can try this to segment the sentence:

sentence = "Something gotta change It must be rearranged I'm sorry, I did not mean to hurt my little girl It's beyond me I cannot carry the weight of the heavy world So good night, good night, good night, good night Good night, good night, good night, good night, good night Hope that things work out all right So much to love, so much to learn But I won't be there to teach you Oh, I know I can be close But I try my best to reach you I'm so sorry I didn't not mean to hurt my little girl It's beyond me I cannot carry the weight of the heavy world So good night, good night, good night, good night Good night, good night, good night, good night Good night, good night, good night good night, good night Hope that things work out all right, yeah Thank you."

temp = []; sentences = []
for i in sentence.split():
  if i[0].isupper() and i != "I":
      sentences.append(" ".join(temp))
      temp = [i]
  else:
    temp.append(i)
sentences.append(" ".join(temp))
sentences.pop(0)
print sentences

Then later, follow this Stanford Parser and NLTK to parse the sentence.

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738