Parsing xlsx files as chunks via streaming/pagination strategy using apache poi

Question

There is a case wherein xlsx,xlsm files having huge amount of data(in orders of 80-100MB) is causing memory heap out of space issues on servers using the load() method of Workbook object, which takes FileInputStream as parameter.

Its intended to load the data, validate the cell content and report error in case there is invalid record entry. If all data is correct then write it to the table.Hence, the following didn't suffice my purpose.

Error While Reading Large Excel Files (xlsx) Via Apache POI

The problem involves paginated parsing, data validating and then writing to database.

score 2 · Answer 1 · answered Sep 24 '15 at 14:43

As xlsx files are in zip format containing content XML, you may remove pages by a simple parsing/discarding, creating a smaller content XML. Then create a smaller xlsx and use Apache POI. Use a test xlsx to develop the parsing. The XML in general has no line breaks or indentation; so an XML beautifier / tree editor might help. Excel uses shared strings so the actual content is hard to see.

Use a zip file system (URLs "jar:file://... .xlsx") to operate on the xlsx.

Thank you. I'll try this in divide and conquer fashion. – Devanshu Kashyap Sep 25 '15 at 09:17 — Devanshu Kashyap, Sep 25 '15 at 09:17

score 0 · Accepted Answer · answered Nov 24 '15 at 11:36

StAX parser is a good approach to this situation. https://docs.oracle.com/javase/tutorial/jaxp/stax/index.html

We can iterate over the sheets to obtain index of value at each cell, and use SharedStringsTable object to get the value at particular cell location.

Parsing xlsx files as chunks via streaming/pagination strategy using apache poi

2 Answers2