0

I am quite new to NLP and am using GATE for it. I am getting OOM Exception if I run my code for large data set(containing 7K+ records). Below is the code where exception occurs.

    /**
 * Run ANNIE
 * 
 * @param controller
 * @throws GateException
 */
public void execute(SerialAnalyserController controller)
        throws GateException {
    TestLogger.info("Running ANNIE...");
    controller.execute();     /**** GateProcessor.java:217 ***/

    // controller.cleanup();
    TestLogger.info("...ANNIE complete");
}

Here is the log :

    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.HashMap.addEntry(Unknown Source)
at java.util.HashMap.put(Unknown Source)
at java.util.HashMap.putAll(Unknown Source)
at gate.annotation.AnnotationSetImpl.<init>(AnnotationSetImpl.java:111)
at gate.jape.SinglePhaseTransducer.attemptAdvance(SinglePhaseTransducer.java:448)
at gate.jape.SinglePhaseTransducer.transduce(SinglePhaseTransducer.java:287)
at gate.jape.MultiPhaseTransducer.transduce(MultiPhaseTransducer.java:168)
at gate.jape.Batch.transduce(Batch.java:352)
at gate.creole.Transducer.execute(Transducer.java:116)
at gate.creole.SerialController.runComponent(SerialController.java:177)
at gate.creole.SerialController.executeImpl(SerialController.java:136)
at gate.creole.SerialAnalyserController.executeImpl(SerialAnalyserController.java:67)
at gate.creole.AbstractController.execute(AbstractController.java:42)
at in.co.test.GateProcessor.execute(GateProcessor.java:217)

I would like to know what exactly is happening with execute function and how it can be resolved. Thanks.

Divya Motiwala
  • 1,659
  • 1
  • 16
  • 24
  • How are you starting GATE, exactly? What is the exact command? I think you'll do well to Google "Exception in thread "main" java.lang.OutOfMemoryError: Java heap space". This is most likely a general Java issue, not a GATE issue. – dmn Feb 26 '13 at 16:34
  • @dmn : I couldn't figure out what exactly you meant by starting GATE ? Am not using it in standalone manner but have embedded GATE JAR in my code and using some of the functionalities. I did google about it and generally most people prefer **-Xmx** to increase heap size. But is there any other way? – Divya Motiwala Feb 27 '13 at 05:17
  • Yes, that's right. What happens when you increase the heap size using -Xmx ? – dmn Feb 28 '13 at 19:30
  • Size of my 7K records is ~2.2 MB. After increasing the heap size using -Xmx, it runs for file size upto ~1.87 MB and then crashes again. Also, in this process, it becomes super slow. I have increased heap size to 512 MB. – Divya Motiwala Mar 01 '13 at 06:35

1 Answers1

6

Processing large (or many) documents in GATE can require lots of memory, GATE needs lots of space to store annotations. On the other hand various processing resources require lots of memory as well: gazetteers, statistical model-based taggers, etc.

A trick in Gate developer GUI is to store the corpus of documents in a data store, then load only the corpus and run the pipeline. GATE is smart enough to load one document at a time, process it, then save & close it before opening the next one. (You can first store an empty corpus in a data store and then "populate" it from a folder, this will again load documents one by one without wasting memory.)

This is exactly what you should do in your code, open document, process, save and close before opening the next one. If you have a single large document you should split it (in a way that doesn't break your annotation performance).

Here is a code example from the "Advanced GATE Embedded" module:

// for each piece of text:

Document doc = (Document)Factory.createResource("gate.corpora.DocumentImpl",
              Utils.featureMap("stringContent", text, "mimeType", mime));
Corpus corpus = Factory.newCorpus("webapp corpus");
try {
  corpus.add(doc);
  application.execute();
  ...
finally {
  corpus.clear();
  Factory.deleteResource(doc);
}
Yasen
  • 1,663
  • 10
  • 17