3

My application creates a spreadsheet, this can be rather large (upto 500,000 rows, each row spread over six sheets, each with about 20 columns).

I use Apache Poi and moved to using SXSSFWorkBook as writes data to temporary files so that memory used is not proportional to size of the final spreadsheet and this works wells.

But the spreadsheet represents metadata changed, and when metadata has changed I want to add a comment to the cell showing the old value. I got this working for small spreadsheets but it always fails with heap memory errors when attempted for large files.

I am not sure if the problem is due to limitation of Poi storing all comments in memory or if I am doing it wrong. Worksheet is my own wrapper class, I only create one DrawingPatriach class per sheet, but it looks like I have to create an anchor for each comment I need.

private void addCellComment(Row r, Cell c, Worksheet sheet, String value)
{
    String formattedValue    =  value.replace('\u0000', '\n');
    int    rowCount         =  value.split("\\\\u000").length;
    ClientAnchor anchor = factory.createClientAnchor();
    anchor.setCol1(c.getColumnIndex());
    anchor.setCol2(c.getColumnIndex()+2);
    anchor.setRow1(r.getRowNum());
    anchor.setRow2(r.getRowNum()+rowCount);

    Drawing drawing = sheet.getDrawing();
    Comment comment = drawing.createCellComment(anchor);
    RichTextString str = factory.createRichTextString(formattedValue);
    comment.setString(str);
    c.setCellComment(comment);
} 
Mark Fitzgerald
  • 3,048
  • 3
  • 24
  • 29
Paul Taylor
  • 13,411
  • 42
  • 184
  • 351
  • 1
    I am not sure how we could help with this. In such situations, you might have to turn to a really good profiler and well, **profile** your application to understand exactly where and how and why memory is consumed. – GhostCat Mar 21 '17 at 08:48
  • How much memory it takes? what are your java memory settings? Which java do you use? How much memory your workstation has? – Krzysztof Cichocki Mar 21 '17 at 08:52
  • Im looking for someone with Poi expertise rather than memory profiling expertise. I allocate 1GB of memory, if I remove the create comment part I can create a spreadsheet of 20,000 rows and memory never hits 300mb, but add comments back in and it always fails on heap memory. Im using jre1.8.0_111 on Windows 10 – Paul Taylor Mar 21 '17 at 09:04
  • 2
    Please can you add more code i'm wondering how you initialize SXSSFWorkbook what is the window size? Have you turn off auto-flushing and using manually control how rows are flushed to disk? – Saulius Next Mar 23 '17 at 12:56
  • @SauliusNext i just do workbook = new SXSSFWorkbook(1000) - I havent turned off flushing, if I don't add commentcells then it works fine using substantialy less memory than if I just used HSSFWorkbook.the issue is specific to comment cells – Paul Taylor Mar 24 '17 at 08:51
  • did you try to add just a few 1000 comments, close the file, reopen it and proceed? Does this improve memory behavior? – Christoph Bimminger Mar 25 '17 at 15:29
  • @ChristophBimminger i dont think it is possible to proceed is it, once I have saved the file doesnt reopening the file mean all the existing rows are now loaded into memory ? – Paul Taylor Mar 26 '17 at 13:07
  • @PaulTaylor could you help me to solve this problem, please ? https://stackoverflow.com/questions/58783190/how-to-read-large-excel-file-in-android-app – Tarek Jan 13 '20 at 13:18

1 Answers1

3

Well, from Apache POI documentation (emphasis mine):

Please note that there are still things that still may consume a large amount of memory based on which features you are using, e.g. merged regions, hyperlinks, comments, ... are still only stored in memory and thus may require a lot of memory if used extensively.

So I guess it is a limitation of Apache POI, and you will have to work around that.

gpeche
  • 21,974
  • 5
  • 38
  • 51
  • Well spotted, okay that has confirmed the issue, I had already raised a bug https://bz.apache.org/bugzilla/show_bug.cgi?id=60894 suspecting this was the case. What I dont have is a workaround, I cannot see a way round it. – Paul Taylor Mar 27 '17 at 07:39
  • Personally I would give up trying to keep that info as comments: In .xlsx format, comments are stored "out-of-band", so POI needs to keep it in memory until each sheet is completed. As .xlsx is basically a bunch of zipped XML files, you can also consider generating the workbook without comments and then add a second pass that unzips, modifies the XML to add the comments and re-zips. Follow these links for more info about this approach: https://msdn.microsoft.com/en-us/library/office/gg278316.aspx https://msdn.microsoft.com/en-us/library/dd979921(v=office.12).aspx – gpeche Mar 28 '17 at 07:44
  • Good idea (at least in principle might take some work) did not realize it was xml, with that in mind would it also be possible to save in stages as per @ChristophBimminger and reopen the file and append to it without reading all the data in using poi or not as that would be simpler – Paul Taylor Mar 28 '17 at 08:40
  • what you mean by 'out-of-band' – Paul Taylor Mar 28 '17 at 08:41
  • 1
    @PaulTaylor https://en.wikipedia.org/wiki/Out-of-band_data In this case, if you inspect a .xlsx you will see that comments are not written to a sheetXXX.xml file, but to a different file, commentXXX.xml. The current POI implementation is not able to "stream" comments to that file, but instead keeps them and writes the whole commentXXX.xml file in one go at sheet completion time. – gpeche Mar 28 '17 at 21:48
  • Okay sound sounds like a plan when I feel brave, shame not been tackled by ApachePoi seems a fairly significant omission. – Paul Taylor Mar 29 '17 at 14:43