7

the website of Stanford CoreNLP

http://nlp.stanford.edu/software/corenlp.shtml

lists dozens of Annotators which work like a charm. I would like to use instances of the Annotators for the common tasks (lemmatization, tagging, parsing) by multiple threads. For example to split up the processing of a massively large (GBs of Text) into threads or to provide web services.

There has been some discussion in the past referring to LocalThreads which, by my understanding, use one instance of an Annotator per Thread (thus avoiding problems regarding thread-safety). This is an option but that way all model files and resources have to be loaded n times as well.

Are the Annotators (or some of them) thread-safe to use? I couldn't find anything conclusive/official in the discussions, docs or faqs.

Matt Ball
  • 354,903
  • 100
  • 647
  • 710
Rüdiger
  • 91
  • 5

1 Answers1

9

Yes, the annotators are intended to be thread-safe. You can create a new AnnotationPipeline (e.g., a new StanfordCoreNLP object), and then many threads can pass annotations into this pipeline without reloading the model for each thread.

Gabor Angeli
  • 5,729
  • 1
  • 18
  • 29
  • If we use multiple AnnotationPipelines, will the engine process the requests concurrently, or will it queue internally? Thanks! – Carol AndorMarten Liebster Oct 23 '15 at 13:38
  • There's no notion of an engine independent of the annotation pipeline. A call to `AnnotationPipeline.annotate()` will use the current thread to run the annotation. You can, however, have multiple threads call `.annotate()` on the same AnnotationPipeline. – Gabor Angeli Oct 24 '15 at 01:26
  • Thanks for the clarification - (unless I'm misunderstanding the internals of the library, which is likely ;) ), will the AnnotationPipeline process the 2 .annotate() requests concurrently? Or will the 2nd call to .annotate() be blocked until the first is completed? – Carol AndorMarten Liebster Oct 26 '15 at 13:10
  • 2
    No problem! And yes, the pipeline will process the two requests concurrently. – Gabor Angeli Oct 26 '15 at 16:12
  • Hi there, though it has be a few months passed but I met a concurrent issue while using multiple threads feeding different Annotations to a single pipeline. Basically each of my thread will instantiate an Annotation object from a string document, and all of the threads will use the same pipeline (which is a StanfordCoreNLP object) to annotate it. 8 threads are used. It will almost always encounter ConcurrentModificationException when trying to FutureTask.get() the Annotation result. – Xing Hu Jan 06 '16 at 10:03
  • Somewhat counterintuitively, a concurrent modification exception is usually not tired to concurrency bugs, but rather happens when you change a collection while holding an open iterator on it. – Gabor Angeli Jan 06 '16 at 15:07
  • Let me create another question post so I could explain this problem clearly. – Xing Hu Jan 06 '16 at 18:45
  • @GaborAngeli Hey you know what, so before I was writing the post, I thought I should run my program multiple times to make sure this problem did happen and easy to reproduce. Though it is way faster than before since multi-threading is applied but still takes time. Therefore I removed currently unused annotators "parse" and "dcoref". Now ConcurrentModificationException is gone....So far I have run it more than 10 times, the exception was never threw. With annotators "parse" and "dcoref" the exception will throw almost every time. – Xing Hu Jan 06 '16 at 19:04
  • @GaborAngeli, is `NERClassifierCombiner` also thread safe and can be passed to multiple threads? – Mike Apr 09 '16 at 09:59
  • The annotators certainly are if run in the pipeline. Not sure about the classes themselves. – Gabor Angeli Feb 10 '17 at 16:02