0

I am trying to read a large text file (7GB) line by line, extract all the n-grams from each line, and store them in a HashMap. I create the following class from this code with some modifications:

public class NgExtract {

    public List<String> ReturnNgrams(int order, String sent) {

    List<String> ngs = new ArrayList<>();
    String[] unigrams = sent.split(" ");     
    String ng;
    for (int i = 0; i < unigrams.length - order + 1; i++) {
        ng = BuildNg(unigrams, i, i + order);
        if (!ngs.contains(ng.trim())) {
            ngs.add(ng.trim());
        }
    }
    return ngs;
}

public String BuildNg(String[] unigrams, int f, int l) {
    StringBuilder ngStr = new StringBuilder();
    for (int i = f; i < l; i++) {
        ngStr.append(i > f ? " " : "").append(unigrams[i]);
    }
    return ngStr.toString();
}
}

I read the text file in a while loop (See below). If I do any of the following inside the while loop, memory consumption keeps going up until there is no memory available on my machine (with 16 GB of RAM). For smaller files (2-3 GB) the program terminates but still the amount of memory consumed is very large (7-8 GB). Therefore, I am guessing there is a memory leak somewhere that I can't find. Additionally, when I try methods such as sentence.split(" ") inside the loop, the program terminates well with no memory problem. So I am almost certain the problem is with the NgExtract.

  1. Create an instance of NgExtract inside the loop call its ReturnNgrams() method each sentence.

  2. Create an instance of NgExtract outside the loop and call its ReturnNgrams() method inside the loop for each sentence.

  3. Define ReturnNgrams as a static method of NgExtract and call it for each sentence inside the loop.

    BufferedReader corpus = new BufferedReader(
    new InputStreamReader(
                    new FileInputStream("path_2_corpus"), "UTF8"));
    
    HashMap<String, Integer> allNgs = new HashMap();
    
    while ((sentence = corpus.readLine()) != null) {
    
       List<String> ngrams = //ReturnNgrams(sentence) approach 1, 2 or 3
       //that I think leads to a (massive memory leak)
    
    if (!allNgs.containsKey(ng)) {
                    allNgs.put(ng, 1);
                } else if (allNgs.containsKey(ng)) {
                    int tmp = allNgs.get(ng);
                    tmp++;
                    allNgs.put(ng, tmp);
                }
    }
    
Community
  • 1
  • 1
MAZDAK
  • 573
  • 1
  • 4
  • 16
  • What does the hashmap `allNgs` contain (number of elements and typical size), for files of 2-3 Gbytes? I suspect that the HashMap itself consumes your memory... – Serge Ballesta Jun 09 '16 at 10:23
  • When the loop terminates it contains every observed n-gram (in the file) and its count. But this is the same for the 7GB file. Just 7GB file has many more lines. – MAZDAK Jun 09 '16 at 10:27

0 Answers0