1

I have a large word document (over 10,000 lines) containing a table of information which must be converted to excel using Java. I am using apache poi to extract the table and to save it to excel. I have the following code and it functions on a subset of rows on an iMac. However, I get a Heap Space exception when running the code on the full document:

public class WordExtractor {
  public static void main(String[] args) {
    try {
      File inputFile = new File("table.docx");
      POITextExtractor extractor = ExtractorFactory.createExtractor(inputFile);

      String text = extractor.getText();
      BufferedReader reader = new BufferedReader(new StringReader(text));
      String line = null;
      boolean breakRead = false;
      int rowCount = 0;
      HSSFWorkbook workbook = new HSSFWorkbook();
      HSSFSheet sheet = workbook.createSheet("sheet1");
      while (!breakRead) {
        line = reader.readLine();
        if (line != null) {
          Row row = sheet.createRow(rowCount);
          StringTokenizer st = new StringTokenizer(line, "\t");
          int cellnum = 0;
          while (st.hasMoreTokens()) {
            Cell cell = row.createCell(cellnum++);
            String token = st.nextToken();
            System.out.println(" = " + token);
            cell.setCellValue(token);
          }
        } else {
          breakRead = true;
        }
        rowCount++;
      }

       try {
         FileOutputStream out =
         new FileOutputStream(new File("new.xls"));
         workbook.write(out);
         out.close();
       } catch (FileNotFoundException e) {
       e.printStackTrace();
       } catch (IOException e) {
       e.printStackTrace();
       }
    } catch (Exception ex) {
      ex.printStackTrace();
    }
  }
}
Deduplicator
  • 44,692
  • 7
  • 66
  • 118
RonanOD
  • 876
  • 1
  • 9
  • 19
  • 1
    check your heap settings in VM. try to increase it. Is this line `System.out.println(" = " + token);` required? Too many string objects get created this way. don't use + operator for string concatenation use `StringBuilder` instead and clear its contents after printing for each iteration. – Nishant Oct 26 '13 at 14:48
  • 1
    Have you tried simply increasing the size of your Java Heap Space? (The default is almost embarrassingly small these days) – Gagravarr Oct 26 '13 at 22:32
  • Can you add the stack trace of the exception to your question, so we can see if it's caused by `ExtractorFactory.createExtractor`, `extractor.getText()`, or the creation of the spreadsheet? – VGR Oct 27 '13 at 14:10
  • It is likely that your use of `-Xmx` is the only thing that made a difference. Java is very good at promptly garbage collecting unreferenced objects, and you will find [many](http://stackoverflow.com/questions/2414105/why-is-it-a-bad-practice-to-call-system-gc) [references](http://docs.oracle.com/cd/E19957-01/817-2180-10/pt_chap5.html#wp57022) that will tell you an explicit call to System.gc() is unnecessary and can even harm performance. – VGR Oct 28 '13 at 12:32
  • That's an interesting point VGR and I had heard that warning before. However, it did run properly once I added only the System.gc() calls. Perhaps this is a feature of the newer VMs? – RonanOD Oct 28 '13 at 13:31
  • It seems unlikely, since the JVM itself will do System.gc() whenever memory is critically low. – VGR Nov 02 '13 at 18:38

1 Answers1

1

Thanks to the advice in the comments, I was able to solve this by removing the unnecessary String object creation with every row. I was able to resolve this anyway by putting System.gc() at the end of the main while loop. Also, I updated the VM arguments to give the application more runtime memory. I used the following settings: -d64 -Xms512m -Xmx4g. Finally, I closed the extractor and file reader objects explicitly prior to excel creation.

Here is the updated code:

public class WordExtractor {
  public static void main(String[] args) {
    try {
      File inputFile = new File("table.docx");
      POITextExtractor extractor = ExtractorFactory.createExtractor(inputFile);
      String text = extractor.getText();
      BufferedReader reader = new BufferedReader(new StringReader(text));
      String line = null;
      boolean breakRead = false;
      int rowCount = 0;
      HSSFWorkbook workbook = new HSSFWorkbook();
      HSSFSheet sheet = workbook.createSheet("sheet1");
      while (!breakRead) {
        line = reader.readLine();
        if (line != null) {
          Row row = sheet.createRow(rowCount);
          StringTokenizer st = new StringTokenizer(line, "\t");
          int cellnum = 0;
          while (st.hasMoreTokens()) {
            Cell cell = row.createCell(cellnum++);
            String token = st.nextToken();
            cell.setCellValue(token);
          }
        } else {
          breakRead = true;
        }
        rowCount++;
        if (rowCount % 100 == 0) {
          // breakRead = true;
          System.gc();
        }
      }
      reader.close();
      extractor.close();
      System.gc();
       try {
       FileOutputStream out =
       new FileOutputStream(new File("new.xls"));
       workbook.write(out);
       out.close();
       System.out.println("Excel written successfully..");

       } catch (FileNotFoundException e) {
       e.printStackTrace();
       } catch (IOException e) {
       e.printStackTrace();
       }
    } catch (Exception ex) {
      ex.printStackTrace();
    }
  }
}
Deduplicator
  • 44,692
  • 7
  • 66
  • 118
RonanOD
  • 876
  • 1
  • 9
  • 19