I am trying to read a large text file (7GB) line by line, extract all the n-grams from each line, and store them in a HashMap. I create the following class from this code with some modifications:
public class NgExtract {
public List<String> ReturnNgrams(int order, String sent) {
List<String> ngs = new ArrayList<>();
String[] unigrams = sent.split(" ");
String ng;
for (int i = 0; i < unigrams.length - order + 1; i++) {
ng = BuildNg(unigrams, i, i + order);
if (!ngs.contains(ng.trim())) {
ngs.add(ng.trim());
}
}
return ngs;
}
public String BuildNg(String[] unigrams, int f, int l) {
StringBuilder ngStr = new StringBuilder();
for (int i = f; i < l; i++) {
ngStr.append(i > f ? " " : "").append(unigrams[i]);
}
return ngStr.toString();
}
}
I read the text file in a while loop (See below). If I do any of the following inside the while loop, memory consumption keeps going up until there is no memory available on my machine (with 16 GB of RAM). For smaller files (2-3 GB) the program terminates but still the amount of memory consumed is very large (7-8 GB). Therefore, I am guessing there is a memory leak somewhere that I can't find. Additionally, when I try methods such as sentence.split(" ")
inside the loop, the program terminates well with no memory problem. So I am almost certain the problem is with the NgExtract.
Create an instance of NgExtract inside the loop call its ReturnNgrams() method each sentence.
Create an instance of NgExtract outside the loop and call its ReturnNgrams() method inside the loop for each sentence.
Define ReturnNgrams as a static method of NgExtract and call it for each sentence inside the loop.
BufferedReader corpus = new BufferedReader( new InputStreamReader( new FileInputStream("path_2_corpus"), "UTF8")); HashMap<String, Integer> allNgs = new HashMap(); while ((sentence = corpus.readLine()) != null) { List<String> ngrams = //ReturnNgrams(sentence) approach 1, 2 or 3 //that I think leads to a (massive memory leak) if (!allNgs.containsKey(ng)) { allNgs.put(ng, 1); } else if (allNgs.containsKey(ng)) { int tmp = allNgs.get(ng); tmp++; allNgs.put(ng, tmp); } }