Stanford Parser multithread usage

Question

Stanford Parser is now 'thread-safe' as of version 2.0 (02.03.2012). I am currently running the command line tools and cannot figure out how to make use of my multiple cores by threading the program.

In the past, this question has been answered with "Stanford Parser is not thread-safe", as the FAQ still says. I am hoping to find someone who has had success threading the latest version.

I have tried using -t flag (-t10 and -tLLP) since that was all I could find in my searches, but both throw errors.

An example of a command I issue is:

java -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser \
-outputFormat "oneline" ./grammar/englishPCFG.ser.gz ./corpus > corpus.lex

Christopher Manning · Accepted Answer · 2013-11-03T16:48:04.403

16

Starting with version 2.0.5, you can now easily use multiple threads with the option -nthreads k. For example, your command can be like this:

java -mx6g edu.stanford.nlp.parser.lexparser.LexicalizedParser -nthreads 4 edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz file.txt > file.stp

(Releases of version 2 prior to 2013 had no way to enable multithreading from the command-line, but only when using the API.)

Internally, you can simultaneously run as many parsing threads inside one JVM process as you want. You can do this either by getting and using multiple LexicalizedParserQuery objects (via the parserQuery() method) or implicitly by calling apply(...) or parseTree(...) off one LexicalizedParser. The -nthreads k option does this for you by sending successive sentences to different parsers using the Executor framework. You can also simultaneously create multiple LexicalizedParser's, e.g., for parsing different languages.

Multiple LexicalizedparserQuery objects share the same grammar (LexicalizedParser), but the memory space savings aren't huge, as most of the memory goes to the transient structures used in chart parsing. So, if you are running lots of parsing threads concurrently, you will need to give a lot of memory to the JVM, as in the example above.

p.s. Sorry, yes, some of the documentation still needs updating. But -tLPP is one flag for specifying language-specific resources. The Stanford Parser has no -t flag.

edited Nov 03 '13 at 16:48

answered Feb 15 '12 at 14:35

Christopher Manning

9,360
34
46

1

Hello, I want to program with the API instead of using command-line. Do you mean there is no need to split the corpus manually, and LexicalizedParser will take care of splitting and combining work? so the multithreading is transparent to the programmer? – Matt Jun 19 '12 at 22:17
3

It's not transparent. It means that you can call LexicalizedParser's parseTree() or apply() methods on different sentences simultaneously and it will work correctly, whereas it didn't used to before version 2.0. How you do things after that is up to you, but the obvious modern Java way would be to not split the corpus but set up an Executor service and have a bunch of parser Executor's running simultaneously. – Christopher Manning Jun 20 '12 at 15:32
thanks, I was watching your NLP online course. That helps a lot too! Respect. – Matt Jun 20 '12 at 18:14
Has any work been done on this since? If not I may be interested in helping the effort to improve performance for command-line users. – Preston Lee Oct 31 '13 at 18:15
Yes. The answer changed in the 2.0.5 release. I'll update the main answer. – Christopher Manning Nov 02 '13 at 23:28

Stanford Parser multithread usage

1 Answers1

Linked