3

I'm trying to extract information from several files using the OpenIE tool from Stanford CoreNLP, it gives an out of memory error when several files are passed to the input, instead of just one.

All files have been queued; awaiting termination...
java.lang.OutOfMemoryError: GC overhead limit exceeded
at edu.stanford.nlp.graph.DirectedMultiGraph.outgoingEdgeIterator(DirectedMultiGraph.java:508)
at edu.stanford.nlp.semgraph.SemanticGraph.outgoingEdgeIterator(SemanticGraph.java:165)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER$1.advance(GraphRelation.java:267)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1102)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.<init>(GraphRelation.java:1083)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER$1.<init>(GraphRelation.java:257)
at edu.stanford.nlp.semgraph.semgrex.GraphRelation$GOVERNER.searchNodeIterator(GraphRelation.java:257)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:320)
at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.matches(CoordinationPattern.java:211)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matchChild(NodePattern.java:514)
at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:542)
at edu.stanford.nlp.naturalli.RelationTripleSegmenter.segmentVerb(RelationTripleSegmenter.java:541)
at edu.stanford.nlp.naturalli.RelationTripleSegmenter.segment(RelationTripleSegmenter.java:850)
at edu.stanford.nlp.naturalli.OpenIE.relationInFragment(OpenIE.java:354)
at edu.stanford.nlp.naturalli.OpenIE.lambda$relationsInFragments$2(OpenIE.java:366)
at edu.stanford.nlp.naturalli.OpenIE$$Lambda$76/1438896944.apply(Unknown Source)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1540)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at edu.stanford.nlp.naturalli.OpenIE.relationsInFragments(OpenIE.java:366)
at edu.stanford.nlp.naturalli.OpenIE.annotateSentence(OpenIE.java:486)
at edu.stanford.nlp.naturalli.OpenIE.lambda$annotate$3(OpenIE.java:554)
at edu.stanford.nlp.naturalli.OpenIE$$Lambda$25/606198361.accept(Unknown Source)
at java.util.ArrayList.forEach(ArrayList.java:1249)
at edu.stanford.nlp.naturalli.OpenIE.annotate(OpenIE.java:554)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:71)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:499)
at edu.stanford.nlp.naturalli.OpenIE.processDocument(OpenIE.java:630)
DONE processing files. 1 exceptions encountered.

I pass the files by input using this call:

java -mx3g -cp stanford-corenlp-3.6.0.jar:stanford-corenlp-3.6.0-models.jar:CoreNLP-to-HTML.xsl:slf4j-api.jar:slf4j-simple.jar edu.stanford.nlp.naturalli.OpenIE file1 file2 file3 etc.

I tried increasing the memory with -mx3g and other variants, and although the amount of processed files increases, it's not much (from 5 to 7, for eg.). Each file individually is processed correctly, so I'm excluding a file with big sentences or many lines.

Is there an option I'm not considering, some OpenIE or Java flag, something that I can use to force a dump to an output, a cleaning, or garbage collection between each file that is processed?

Thank you in advance

smothP
  • 87
  • 1
  • 7
  • code to invoke please – Woot4Moo Apr 05 '16 at 18:03
  • 1
    How large are the files you're processing (e.g., in words)? How many threads does your computer have? One thing you can try is to set `-threads 1` and disable parallelism in processing the documents. This could solve the problem if it's loading many large documents at once. – Gabor Angeli Apr 05 '16 at 22:48
  • @Woot4Moo I call openIE directly from the shell, using the java call I put there, without changing the source code provided, but thanks anyway. – smothP Apr 06 '16 at 02:19
  • @GaborAngeli it worked with the `-threads 1` flag! Thank you!! If you want to, answer officially the question so I can mark it as solved :) For disclosure, the files are around 15Kb, with 2000-4000 words (10-15 per line), I think. – smothP Apr 06 '16 at 02:25
  • @GaborAngeli unrelated question: do you known if it would be possible to send some output to the output file (using the shell, etc.) that divides each processed file? Because OpenIE dumps everything together to the output file provided. Thank you – smothP Apr 06 '16 at 02:29
  • 1
    @smothP Excellent! Chances are, increasing the memory by a few GB should get it to work multithreaded as well. The CoreNLP Annotation objects are quite big, and really OpenIE produces probably more intermediate garbage than it should -- especially for long sentences. RE different outputs: it's a good idea for a new feature. For now, you can set the output format to `-format reverb`, and then the first column will have the input filename, which you can then use to route the output. – Gabor Angeli Apr 06 '16 at 02:33
  • 1
    (see http://reverb.cs.washington.edu/README.html for the ReVerb output format) – Gabor Angeli Apr 06 '16 at 02:36

2 Answers2

5

Run this command to get a separate annotation per file (sample-file-list.txt should be one file per line)

java -Xmx4g -cp "stanford-corenlp-full-2015-12-09/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie -filelist sample-file-list.txt -outputDirectory output_dir -outputFormat text
StanfordNLPHelp
  • 8,699
  • 1
  • 11
  • 9
  • 1
    Note: I just fixed this command since in the original I was using a properties file on my local machine! – StanfordNLPHelp Apr 06 '16 at 06:33
  • 1
    Also there are a variety of outputFormats (json, xml) I just like to use text for readability, but its probably poor for passing on to a next step in a pipeline. – StanfordNLPHelp Apr 06 '16 at 06:34
  • 1
    Note that this will dump a lot of extra stuff alongside OpenIE; i.e., all the other CoreNLP annotations. – Gabor Angeli Apr 06 '16 at 06:41
  • Thank you both. That worked, but I'll use @GaborAngeli 's suggestion of outputting to a ReVerb format, as I was already using ReVerb to other stuff. – smothP Apr 06 '16 at 11:46
1

From the comments above: I suspect this is an issue with too much parallelism and too little memory. OpenIE is a bit memory hungry, especially with long sentences, and so running many files in parallel can take up a fair bit of memory.

An easy fix is to force the program to run single-threaded, by setting the -threads 1 flag. If possible, increasing memory should help as well.

Gabor Angeli
  • 5,729
  • 1
  • 18
  • 29
  • Thank you again! My machine only has 4Gb, so I only tried until 3Gb of memory. I'll try to have access to a machine with more memory just to test it, but this solution is perfect. – smothP Apr 06 '16 at 11:44