1

I am using Java8 with Apache OpenNLP. I have a service that extracts all the nouns from a paragraph. This works as expected on my localhost server. I also had this running on an OpenShift server with no problems. However, it does use a lot of memory. I need to deploy my application to AWS Elastic Beanstalk Tomcat Server.

One solution is I could probably upgrade from AWS Elastic Beanstalk Tomcat Server t1.micro to another instance type. But I am on a small budget, and want to avoid the extra fees if possible.

When I run the app, and it tries to do the word chunking, it gets the following error:

dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space] with root cause
 java.lang.OutOfMemoryError: Java heap space
  at opennlp.tools.ml.model.AbstractModelReader.getParameters(AbstractModelReader.java:148)
  at opennlp.tools.ml.maxent.io.GISModelReader.constructModel(GISModelReader.java:75)
  at opennlp.tools.ml.model.GenericModelReader.constructModel(GenericModelReader.java:59)
  at opennlp.tools.ml.model.AbstractModelReader.getModel(AbstractModelReader.java:87)
  at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:35)
  at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:31)
  at opennlp.tools.util.model.BaseModel.finishLoadingArtifacts(BaseModel.java:328)
  at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:256)
  at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:179)
  at opennlp.tools.parser.ParserModel.<init>(ParserModel.java:180)
  at com.jobs.spring.service.lang.LanguageChunkerServiceImpl.init(LanguageChunkerServiceImpl.java:35)
  at com.jobs.spring.service.lang.LanguageChunkerServiceImpl.getNouns(LanguageChunkerServiceImpl.java:46)

Question

Is there a way to either:

  1. Reduce the amount of memory used when extracting the nouns from a paragraph.

  2. Use a different api other than Apache OpenNLP that won't use as much memory.

  3. A way to configure AWS Elastic Beanstalk Tomcat Server to cope with the demands.

Code Sample:

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.HashSet;
import java.util.Set;

import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Transactional;

import opennlp.tools.cmdline.parser.ParserTool;
import opennlp.tools.parser.Parse;
import opennlp.tools.parser.Parser;
import opennlp.tools.parser.ParserFactory;
import opennlp.tools.parser.ParserModel;
import opennlp.tools.util.InvalidFormatException;

@Component("languageChunkerService")
@Transactional
public class LanguageChunkerServiceImpl implements LanguageChunkerService {

    private Set<String> nouns = null;
    private InputStream modelInParse = null;
    private ParserModel model = null;
    private Parser parser = null;

    public void init() throws InvalidFormatException, IOException {
        ClassLoader classLoader = getClass().getClassLoader();
        File file = new File(classLoader.getResource("en-parser-chunking.bin").getFile());
        modelInParse = new FileInputStream(file.getAbsolutePath());

        // load chunking model
        model = new ParserModel(modelInParse); // line 35
        // create parse tree
        parser = ParserFactory.create(model);
    }

    @Override
    public Set<String> getNouns(String sentenceToExtract) {
        Set<String> extractedNouns = new HashSet<String>();
        nouns = new HashSet<>();
        try {
            if (parser == null) {
                init();
            }

            Parse topParses[] = ParserTool.parseLine(sentenceToExtract, parser, 1);

            // call subroutine to extract noun phrases
            for (Parse p : topParses) {
                getNounPhrases(p);
            }

            // print noun phrases
            for (String s : nouns) {
                String word = s.replaceAll("[^a-zA-Z ]", "").toLowerCase();// .split("\\s+");
                //System.out.println(word);
                extractedNouns.add(word);
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            if (modelInParse != null) {
                try {
                    modelInParse.close();
                } catch (IOException e) {
                }
            }
        }
        return extractedNouns;
    }

    // recursively loop through tree, extracting noun phrases
    private void getNounPhrases(Parse p) {
        if (p.getType().equals("NN")) { // NP=noun phrase
            // System.out.println(p.getCoveredText()+" "+p.getType());
            nouns.add(p.getCoveredText());
        }
        for (Parse child : p.getChildren())
            getNounPhrases(child);
    }
}

UPDATE

Tomcat8 config:

enter image description here

Richard
  • 8,193
  • 28
  • 107
  • 228
  • Run a profiler on your computer to see what kind of memory amounts the program needs during normal runs, then you can determine how much you need to up your budget. Even if you could whittle down some minor chunks, you'd still be running at the edge of memory, and that would make it very unstable. – Kayaman Mar 28 '17 at 06:49
  • Good idea. Will do. Also, do you think there's a way to reduce the `en-parser-chunking.bin`? i.e. It may be loading a number of features I may not require. I know OpenNLP does a number of different language parsing, and I only need to extract nouns. – Richard Mar 28 '17 at 06:52
  • I'm not familiar with OpenNLP, but since it's probably using a major part of your memory, you might want to read the documentation very carefully. – Kayaman Mar 28 '17 at 06:54
  • Perhaps I should use a different toolkit. Does anyone have experience with any of these, and can recommend one: https://en.wikipedia.org/wiki/Outline_of_natural_language_processing#Natural_language_processing_toolkits – Richard Mar 28 '17 at 07:13
  • By the way, how much memory do you have (1GB?)? How much heap are you giving Tomcat, and do you have anything else running on the system? – Kayaman Mar 29 '17 at 06:43
  • I am running a t1.micro server instance (600mb ram). http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html. All I have running is a Tomcat8 server. – Richard Mar 29 '17 at 13:22
  • You missed the important question. How much heap are you giving Tomcat? – Kayaman Mar 29 '17 at 13:22
  • Sorry, I have added an UPDATE above with the memory sizes. – Richard Mar 29 '17 at 13:30
  • Well, you could tweak the memory a bit higher, it would at least be more effective than trying to micro-optimize your code, but like I said, on your machine you can see how much memory it's going to need and based on that get an instance with more memory. – Kayaman Mar 29 '17 at 13:37
  • Thank you. I haven't had a chance to profile the code yet, but will do so soon. – Richard Mar 29 '17 at 13:39
  • How much more expensive is the T2.micro ? Amazon specifies it replaces the t1.micro which has been phased out. I could not find the pricing for a t1 family. The T2 has 1G of ram perhaps that would help though it all depends on your dataset. profiling here is the real key to get to the bottom of it. – Newtopian Mar 29 '17 at 13:54
  • T1.micro is free for a year. I think t2.micro is about 12 USD per month. I do need to profile my code. – Richard Mar 29 '17 at 13:56
  • Which version of OpenNLP do you use exactly? Might be more important to know, than the costs for other AWS instances. – MWiesner May 12 '17 at 19:16

1 Answers1

-1

First of all you should try to optimize your code. This starts by using regex and precompile statements before using replaceAll since replaceAll replaced the stuff in the memory. (https://eyalsch.wordpress.com/2009/05/21/regex/)

Second you should not store the parsed sentence in an array. Third hint: you should try to allocate the memory to your array using bytebuffer. Another Hint which may affect you at the most, you should use a BufferedReader to read your chunked file. (out of memory error, java heap space)

After this you should already see less memory usage. If those tips didnt help you please provide a memory dump/allocation graph.

One more tip: A HashSet takes 5.5x more memory then an unordered List. (Performance and Memory allocation comparision between List and Set)

Community
  • 1
  • 1
Emanuel
  • 8,027
  • 2
  • 37
  • 56
  • I can't see *anything* right with this answer. Using `Pattern` instead of `replaceAll` may affect the speed, but not so much the memory. Using `ByteBuffer` instead of array, no sense there either, unless he's working with code that expects buffers. `BufferedReader` has nothing to do with anything here, as it also affects speed rather than memory use. Finally the one thing that *does* affect memory (`HashSet`), could also affect speed in a dramatical way. – Kayaman Mar 28 '17 at 06:24
  • "not so much". If he is low of budget and get out of memory he have to optimize everything even if it takes only a few kb of memory. – Emanuel Mar 28 '17 at 06:25
  • Hi Emanuel, thank you for those tips, I will implement them to try get it more efficient. From the stack trace, you can see that `line 35` is where the code falls over with an `OutOfMemoryError`, so it's in the initialization of the `OpenNLP` api. So I agree, that your above suggestions will improve the code, I don't think it will solve the `OutOfMemoryError`. (p.s. I'm not sure why someone gave you a down vote, I find your suggestions useful.) – Richard Mar 28 '17 at 06:26
  • 1
    @Richard You just think they're useful. As I explained in my comment, they're pretty much guesswork and about as useful as "use `short` instead of `int` to save 2 bytes per variable". Since you're running it at localhost, you can use a profiler to examine the memory use instead of trying things that have very little chance of helping you. – Kayaman Mar 28 '17 at 06:28
  • Np. Keep in mind that "Out of memory" may be caused because of memory usage somewhere else then OpenNLP. It may happen that there is just not enough memory for OpenNLP available which is used and not released before. Kayaman, using an unordered list instead of a map consumes 5.5 times less memory(!) which is not the same like comparing short to an integer – Emanuel Mar 28 '17 at 06:28
  • 1
    It's still micro-optimization, and no, it's not suitable in this situation. When you're talking about embedded environments, then you can start planning on saving bytes. If you're trying to do NLP or some other things that require resources, you get those resources. We're talking about `t1.micro` instance here, which is the tiniest thing you can get. If you can't afford to put some more money in it, then you should really be looking into getting a job instead of trying to optimize memory. – Kayaman Mar 28 '17 at 10:06