0

I have a class KeywordCount which tokenizes a given sentence and tags it using a maxent tagger by Apache OpenNLP-POS tagger. I first tokenize the output and then feed it to the tagger. I have a problem of RAM usage of upto 165 MB after the jar has completed its tasks. The rest of the program just makes a DB call and checks for new tasks. I have isolated the leak to this class. You can safely ignore the Apache POI Excel code. I need to know if any of you can find the leak in the code.

public class KeywordCount {
Task task;
String taskFolder = "";
List<String> listOfWords;

public KeywordCount(String taskFolder) {
    this.taskFolder = taskFolder;
    listOfWords = new ArrayList<String>();
}

public void tagText() throws Exception {
    String xlsxOutput = taskFolder + File.separator + "results_pe.xlsx";

    FileInputStream fis = new FileInputStream(new File(xlsxOutput));
    XSSFWorkbook wb = new XSSFWorkbook(fis);
    XSSFSheet sheet = wb.createSheet("Keyword Count");
    XSSFRow row = sheet.createRow(0);
    Cell cell = row.createCell(0);

    XSSFCellStyle csf = (XSSFCellStyle)wb.createCellStyle();
    csf.setVerticalAlignment(CellStyle.VERTICAL_TOP);
    csf.setBorderBottom(CellStyle.BORDER_THICK);
    csf.setBorderRight(CellStyle.BORDER_THICK);
    csf.setBorderTop(CellStyle.BORDER_THICK);
    csf.setBorderLeft(CellStyle.BORDER_THICK);
    Font fontf = wb.createFont();
    fontf.setColor(IndexedColors.GREEN.getIndex());
    fontf.setBoldweight(Font.BOLDWEIGHT_BOLD);
    csf.setFont(fontf);



    int rowNum = 0;
    BufferedReader br = null;
    InputStream modelIn = null;
    POSModel model = null;
    try {
      modelIn = new FileInputStream("taggers" + File.separator + "en-pos-maxent.bin");
      model = new POSModel(modelIn);
    }
    catch (IOException e) {
      // Model loading failed, handle the error
      e.printStackTrace();
    }
    finally {
      if (modelIn != null) {
        try {
          modelIn.close();
        }
        catch (IOException e) {
        }
      }
    }
    File ftmp = new File(taskFolder + File.separator + "phrase_tmp.txt");
    if(ftmp.exists()) {
        br = new BufferedReader(new FileReader(ftmp));
        POSTaggerME tagger = new POSTaggerME(model);
        String line = "";
        while((line = br.readLine()) != null) {
            if (line.equals("")) {
                break;
            }
            row = sheet.createRow(rowNum++);
            if(line.startsWith("Match")) {
                int index = line.indexOf(":");
                line = line.substring(index + 1);
                String[] sent = getTokens(line);
                String[] tags = tagger.tag(sent); 
                for(int i = 0; i < tags.length; i++) {
                    if (tags[i].equals("NN") || tags[i].equals("NNP") || tags[i].equals("NNS") || tags[i].equals("NNPS")) {
                        listOfWords.add(sent[i].toLowerCase());
                    } else if (tags[i].equals("JJ") || tags[i].equals("JJR") || tags[i].equals("JJS")) {
                        listOfWords.add(sent[i].toLowerCase());
                    }
                }

                Map<String, Integer> treeMap = new TreeMap<String, Integer>();
                for(String temp : listOfWords) {
                    Integer counter = treeMap.get(temp);
                    treeMap.put(temp, (counter == null) ? 1 : counter + 1);
                }
                listOfWords.clear();
                sent = null;
                tags = null;
                if (treeMap != null || !treeMap.isEmpty()) {
                    for(Map.Entry<String, Integer> entry : treeMap.entrySet()) {
                        row = sheet.createRow(rowNum++);
                        cell = row.createCell(0);
                        cell.setCellValue(entry.getKey().substring(0, 1).toUpperCase() + entry.getKey().substring(1));
                        XSSFCell cell1 = row.createCell(1);
                        cell1.setCellValue(entry.getValue());
                    }
                    treeMap.clear();
                }
                treeMap = null;
            }
            rowNum++;
        }
        br.close();
        tagger = null;
        model = null;
    }
    sheet.autoSizeColumn(0);
    fis.close();

    FileOutputStream fos = new FileOutputStream(new File(xlsxOutput));
    wb.write(fos);
    fos.close();
    System.out.println("Finished writing XLSX file for Keyword Count!!");
}

public String[] getTokens(String match) throws Exception {
    InputStream modelIn = new FileInputStream("taggers" + File.separator + "en-token.bin");
    TokenizerModel model = null;
    try {
      model = new TokenizerModel(modelIn);
    }
    catch (IOException e) {
      e.printStackTrace();
    }
    finally {
      if (modelIn != null) {
        try {
          modelIn.close();
        }
        catch (IOException e) {
        }
      }
    }

    Tokenizer tokenizer = new TokenizerME(model);
    String tokens[] = tokenizer.tokenize(match);
    model = null;

    return tokens;
}

}

My system GCed the RAM after 165MB...but when I upload to the server the GC is not performed and it rises upto 480 MB(49% of RAM usage).

Mallik Kumar
  • 540
  • 1
  • 5
  • 28

1 Answers1

2

First of all, increased heap usage is not evidence of a memory leak. It may simply be that the GC has not run yet.

Having said that, it is doubtful that anyone can spot a memory leak just by "eyeballing" your code. The correct way to solve this is for >>you<< to read up on the techniques for finding Java memory leaks, and >>you<< then use the relevant tools (e.g. visualvm, jhat, etc) to search for the problem yourself.

Here are some references on finding storage leaks:


Note 1: This link is liable to break. If it does, use Google to find the article.


I have isolated the leak to this class. You can safely ignore the Apache POI Excel code.

If we ignore the Apache POI code, the only source of a potential memory "leakage" is that the word list ( listOfWords ) is retained. (Calling clear() will null out its contents, but the backing array is retained, and that array's size is determined by the maximum list size. From a memory footprint perspective, it would be better to replace the list with a new empty list.)

However, that is only a "leak" if you keep a reference to the KeywordCount instance. And if you are doing that because you are using the instance, I wouldn't call that a leak at all.

Community
  • 1
  • 1
Stephen C
  • 698,415
  • 94
  • 811
  • 1,216