6

I'm facing a concurrency problem in annotating multiple sentences simultaneously. It's unclear to me whether I'm doing something wrong or maybe there is a bug in CoreNLP.

My goal is to annotate sentences with the pipeline "tokenize, ssplit, pos, lemma, ner, parse, dcoref" using several threads running in parallel. Each thread allocates its own instance of StanfordCoreNLP and then uses it for the annotation.

The problem is that at some point an exception is thrown:

java.util.ConcurrentModificationException
 at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
 at java.util.ArrayList$Itr.next(ArrayList.java:851)
 at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1042)
 at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:463)
 at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
 at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
 at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
 at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
 at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
 at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
 at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
 at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
 at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
 at edu.stanford.nlp.trees.GrammaticalStructure.analyzeNode(GrammaticalStructure.java:488)
 at edu.stanford.nlp.trees.GrammaticalStructure.<init>(GrammaticalStructure.java:201)
 at edu.stanford.nlp.trees.EnglishGrammaticalStructure.<init>(EnglishGrammaticalStructure.java:89)
 at edu.stanford.nlp.semgraph.SemanticGraphFactory.makeFromTree(SemanticGraphFactory.java:139)
 at edu.stanford.nlp.pipeline.DeterministicCorefAnnotator.annotate(DeterministicCorefAnnotator.java:89)
 at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:68)
 at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:412)

I'm attaching a sample code of an application that reproduces the problem in about 20 seconds on my Core i3 370M laptop (Win 7 64bit, Java 1.8.0.45 64bit). This app reads an XML file of the Recognizing Textual Entailment (RTE) corpora and then parses all sentences simultaneously using standard Java concurrency classes. The path to a local RTE XML file needs to be given as a command line argument. In my tests I used the publicly available XML file here: http://www.nist.gov/tac/data/RTE/RTE3-DEV-FINAL.tar.gz

package semante.parser.stanford.server;

import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.PrintStream;
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.List;
import java.util.Properties;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.Unmarshaller;
import javax.xml.bind.annotation.XmlAccessType;
import javax.xml.bind.annotation.XmlAccessorType;
import javax.xml.bind.annotation.XmlAttribute;
import javax.xml.bind.annotation.XmlElement;
import javax.xml.bind.annotation.XmlRootElement;

import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

public class StanfordMultiThreadingTest {

 @XmlRootElement(name = "entailment-corpus")
 @XmlAccessorType (XmlAccessType.FIELD)
 public static class Corpus {
  @XmlElement(name = "pair")
  private List<Pair> pairList = new ArrayList<Pair>();

  public void addPair(Pair p) {pairList.add(p);}
  public List<Pair> getPairList() {return pairList;}
 }

 @XmlRootElement(name="pair")
 public static class Pair {

  @XmlAttribute(name = "id")
  String id;

  @XmlAttribute(name = "entailment")
  String entailment;

  @XmlElement(name = "t")
  String t;

  @XmlElement(name = "h")
  String h;

  private Pair() {}

  public Pair(int id, boolean entailment, String t, String h) {
   this();
   this.id = Integer.toString(id);
   this.entailment = entailment ? "YES" : "NO";
   this.t = t;
   this.h = h;
  }

  public String getId() {return id;}
  public String getEntailment() {return entailment;}
  public String getT() {return t;}
  public String getH() {return h;}
 }
 
 class NullStream extends OutputStream {
  @Override 
  public void write(int b) {}
 };

 private Corpus corpus;
 private Unmarshaller unmarshaller;
 private ExecutorService executor;

 public StanfordMultiThreadingTest() throws Exception {
  javax.xml.bind.JAXBContext jaxbCtx = JAXBContext.newInstance(Pair.class,Corpus.class);
  unmarshaller = jaxbCtx.createUnmarshaller();
  executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
 }

 public void readXML(String fileName) throws Exception {
  System.out.println("Reading XML - Started");
  corpus = (Corpus) unmarshaller.unmarshal(new InputStreamReader(new FileInputStream(fileName), StandardCharsets.UTF_8));
  System.out.println("Reading XML - Ended");
 }

 public void parseSentences() throws Exception {
  System.out.println("Parsing - Started");

  // turn pairs into a list of sentences
  List<String> sentences = new ArrayList<String>();
  for (Pair pair : corpus.getPairList()) {
   sentences.add(pair.getT());
   sentences.add(pair.getH());
  }

  // prepare the properties
  final Properties props = new Properties();
  props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

  // first run is long since models are loaded
  new StanfordCoreNLP(props);

  // to avoid the CoreNLP initialization prints (e.g. "Adding annotation pos")
  final PrintStream nullPrintStream = new PrintStream(new NullStream());
  PrintStream err = System.err;
  System.setErr(nullPrintStream);

  int totalCount = sentences.size();
  AtomicInteger counter = new AtomicInteger(0);

  // use java concurrency to parallelize the parsing
  for (String sentence : sentences) {
   executor.execute(new Runnable() {
    @Override
    public void run() {
     try {
      StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
      Annotation annotation = new Annotation(sentence);
      pipeline.annotate(annotation);
      if (counter.incrementAndGet() % 20 == 0) {
       System.out.println("Done: " + String.format("%.2f", counter.get()*100/(double)totalCount));
      };
     } catch (Exception e) {
      System.setErr(err);
      e.printStackTrace();
      System.setErr(nullPrintStream);
      executor.shutdownNow();
     }
    }
   });
  }
  executor.shutdown();
  
  System.out.println("Waiting for parsing to end.");  
  executor.awaitTermination(10, TimeUnit.MINUTES);

  System.out.println("Parsing - Ended");
 }

 public static void main(String[] args) throws Exception {
  StanfordMultiThreadingTest smtt = new StanfordMultiThreadingTest();
  smtt.readXML(args[0]);
  smtt.parseSentences();
 }

}

In my attempt to find some background information I encountered answers given by Christopher Manning and Gabor Angeli from Stanford which indicate that contemporary versions of Stanford CoreNLP should be thread-safe. However, a recent bug report on CoreNLP version 3.4.1 describes a concurrency problem. As mentioned in the title, I'm using version 3.5.2.

It's unclear to me whether the problem I'm facing is due to a bug or due to something wrong in the way I use the package. I'd appreciate it if someone more knowledgeable could shed some light on this. I hope that the sample code would be useful for reproducing the problem. Thanks!

[1]:

Community
  • 1
  • 1
Assaf
  • 184
  • 1
  • 11

2 Answers2

9

Have you tried using the threads option? You can specify a number of threads for a single StanfordCoreNLP pipeline and then it will process sentences in parallel.

For example, if you want to process sentences on 8 cores, set the threads option to 8:

Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
props.put("threads", "8")
StanfordCoreNLP pipeline  = new StanfordCoreNLP(props);

Nevertheless I think your solution should also work and we'll check whether there is some concurrency bug, but using this option might solve your problem in the meantime.

Sebastian Schuster
  • 1,563
  • 10
  • 7
  • Thanks for the suggestion. I'd like to try that but I'm not sure how to use the interface. Assuming that I set the 'threads' property is set, how should I pass the sentences to be annotated in parallel? using multiple threads that use the same instance of StanfordCoreNLP? or by a method different than 'annotate()' which passes several sentences at once? Thanks! Calls to – Assaf Jun 07 '15 at 13:55
  • The argument to the constructor of `Annotation` is actually not a sentence but an entire document. Store several (or even all) sentences in the `sentence` variable and separate them with "\n". Also set the option "ssplit.eolonly" to "true" in order to prevent the sentence splitter from splitting an actual sentence by mistake. After parsing, the annotation object contains a list of sentences where each sentence has the parsing, pos, lemma etc. annotations. – Sebastian Schuster Jun 07 '15 at 18:11
  • Thanks, I tried that. However, either there's a problem with the mode of annotating multiple sentences separated by '\n' or I'm doing something wrong. I'm able to get 100 sentences parsed, but not 1000 or 2000. When fed with 1000 or 2000 sentences the call to annotate() runs endlessly. In addition, there's almost no difference in performance between 1, 2 or 4 threads (my hardware has 4) when I'm testing with 100 sentences. It is slightly slower than using a single thread and calling annotate() with one sentence at a time. – Assaf Jun 08 '15 at 20:53
  • I have an updated sample code here: https://dl.dropboxusercontent.com/u/21642925/StanfordMultiThreadingTest.java To run it you can play with 3 parameters: annotationMode - either 'together' (a single call to annotate() with multiple sentences separated by '\n') or 'separated' (multiple calls to annotate() each of which with a single sentence); coresMode - either a single core, half the number of cores, or all cores; maxSentences - the maximal number of sentences to parse. I will really appreciate it if you could try to run this code and let me know if you manage to reproduce these problems. – Assaf Jun 08 '15 at 20:58
  • @Assaf did you find any solution to it on your own? else I think this would be the accepted answer? – TheRajVJain Jan 29 '18 at 10:36
  • 1
    @RajVJain - what answer? – Assaf Jan 30 '18 at 20:53
  • My tests show similar performance (i.e. no real improvement) - this Stack Overflow question seems to shed some light on it by indicating that only a few annotators are actually threadsafe: https://stackoverflow.com/a/51662061/498949. – Chris Rae Jan 31 '19 at 05:00
  • Sorry, should have made clearer that specifying that -threads option just says to run multiple threads for any annotators that are thread-safe. Which it seems is only a handful and they're not ones that normally take a lot of CPU time. – Chris Rae Jan 31 '19 at 19:19
0

I had the same problem and using a build from the latest github revision (today) solved the problem. So I think it is a CoreNLP issue that has been solved since 3.5.2.

See also CoreNLP on Apache Spark

Community
  • 1
  • 1
peschü
  • 1,299
  • 12
  • 21