57

I have a large .xlsx file (141 MB, containing 293413 lines with 62 columns each) I need to perform some operations within.

I am having problems with loading this file (OutOfMemoryError), as POI has a large memory footprint on XSSF (xlsx) workbooks.

This SO question is similar, and the solution presented is to increase the VM's allocated/maximum memory.

It seems to work for that kind of file-size (9MB), but for me, it just simply doesn't work even if a allocate all available system memory. (Well, it's no surprise considering the file is over 15 times larger)

I'd like to know if there is any way to load the workbook in a way it won't consume all the memory, and yet, without doing the processing based (going into) the XSSF's underlying XML. (In other words, maintaining a puritan POI solution)

If there isn't tough, you are welcome to say it ("There isn't.") and point me the ways to a "XML" solution.

Community
  • 1
  • 1
CosmicGiant
  • 6,275
  • 5
  • 43
  • 58
  • 1
    I think the SO question more relavent to what your asking is "Processing large xlsx file in Java". Try this API http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api as I think it just parses the data, it doesn't store the stuff. – Bob Kuhar Aug 09 '12 at 21:50
  • Do you just need to read the file in to work with it? Or do you need to do a full read / edit / write cycle? – Gagravarr Nov 10 '12 at 21:47
  • @Gagravarr - I theoretically need a full cycle, but as far as POI's scope goes, simply reading is enough, since for this specific case the final file can be saved in .txt(tab-separated), .csv, or similar; In other words, plain-text that I can manage without POI after the data has been extracted. – CosmicGiant Nov 12 '12 at 11:38

8 Answers8

78

I was in a similar situation with a webserver environment. The typical size of the uploads were ~150k rows and it wouldn't have been good to consume a ton of memory from a single request. The Apache POI Streaming API works well for this, but it requires a total redesign of your read logic. I already had a bunch of read logic using the standard API that I didn't want to have to redo, so I wrote this instead: https://github.com/monitorjbl/excel-streaming-reader

It's not entirely a drop-in replacement for the standard XSSFWorkbook class, but if you're just iterating through rows it behaves similarly:

import com.monitorjbl.xlsx.StreamingReader;

InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
StreamingReader reader = StreamingReader.builder()
        .rowCacheSize(100)    // number of rows to keep in memory (defaults to 10)
        .bufferSize(4096)     // buffer size to use when reading InputStream to file (defaults to 1024)
        .sheetIndex(0)        // index of sheet to use (defaults to 0)
        .read(is);            // InputStream or File for XLSX file (required)

for (Row r : reader) {
  for (Cell c : r) {
    System.out.println(c.getStringCellValue());
  }
}     

There are some caveats to using it; due to the way XLSX sheets are structured, not all data is available in the current window of the stream. However, if you're just trying to read simple data out from the cells, it works pretty well for that.

monitorjbl
  • 4,280
  • 3
  • 36
  • 45
  • Goes right except for empty cell, which are not returned and that caused to me a problem because I'm not able to identify which column was really missing on each row ( only I know have been returned less than expected ) – darkman97i Mar 03 '16 at 17:42
  • 2
    This solution is great. On the git page there is a new usage defined: `InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx")); Workbook workbook = StreamingReader.builder().rowCacheSize(100).bufferSize(4096).sheetIndex(0).open(is);` If all you need to do is iterate over the rows, this allows you to treat the worksheet exactly the same as if it was loaded in memory. – Anthony Ruffino Jul 28 '16 at 16:54
  • @darkman97i: `Cell cell = row.getCell(i, Row.CREATE_NULL_AS_BLANK)` is now supported and resolves the issue. See [here](https://github.com/monitorjbl/excel-streaming-reader/issues/33) – themenace Mar 13 '17 at 16:00
  • 1
    This solution work well for reading the file content but does not work if you want to get the cell styles from the workbook. – Arthur bauer May 15 '17 at 14:52
  • It does not support .xls files when using the library. Do you have any suggestions? Thank you! – nguyenngoc101 May 13 '19 at 08:38
  • How does this work with Formula cells? I'm not sure if the result of the formula is stored in the cell's data in addition to the formula, or if it's calculated as you read through them. – DFW Jul 30 '20 at 18:19
  • You just saved my job (kidding!). It works so well. I tried everything and nothing worked. Just wished formula cells could be handled as well. For now i've requested clients not to upload files with formulas. – saran3h Mar 22 '21 at 16:00
20

A improvement in memory usage can be done by using a File instead of a Stream. (It is better to use a streaming API, but the Streaming API's have limitations, see http://poi.apache.org/spreadsheet/index.html)

So instead of

Workbook workbook = WorkbookFactory.create(inputStream);

do

Workbook workbook = WorkbookFactory.create(new File("yourfile.xlsx"));

This is according to : http://poi.apache.org/spreadsheet/quick-guide.html#FileInputStream

Files vs InputStreams

"When opening a workbook, either a .xls HSSFWorkbook, or a .xlsx XSSFWorkbook, the Workbook can be loaded from either a File or an InputStream. Using a File object allows for lower memory consumption, while an InputStream requires more memory as it has to buffer the whole file."

rjdkolb
  • 10,377
  • 11
  • 69
  • 89
11

The Excel support in Apache POI, HSSF and XSSF, supports 3 different modes.

One is a full, DOM-Like in-memory "UserModel", which supports both reading and writing. Using the common SS (SpreadSheet) interfaces, you can code for both HSSF (.xls) and XSSF (.xlsx) basically transparently. However, it needs lots of memory.

POI also supports a streaming read-only way to process the files, the EventModel. This is much more low-level than the UserModel, and gets you very close to the file format. For HSSF (.xls) you get a stream of records, and optionally some help with handling them (missing cells, format tracking etc). For XSSF (.xlsx) you get streams of SAX events from the different parts of the file, with help to get the right part of the file and also easy processing of common but small bits of the file.

For XSSF (.xlsx) only, POI also supports a write-only streaming write, suitable for low level but low memory writing. It largely just supports new files though (certain kinds of append are possible). There is no HSSF equivalent, and due to back-and-forth byte offsets and index offsets in many records it would be pretty hard to do...

For your specific case, as described in your clarifying comments, I think you'll want to use the XSSF EventModel code. See the POI documentation to get started, then try looking at these three classes in POI and Tika which use it for more details.

Gagravarr
  • 47,320
  • 10
  • 111
  • 156
9

POI now includes an API for these cases. SXSSF http://poi.apache.org/spreadsheet/index.html It does not load everything on memory so it could allow you to handle such file.

Note: I have read that SXSSF works as a writing API. Loading should be done using XSSF without inputstream'ing the file (to avoid a full load of it in memory)

Alfabravo
  • 7,493
  • 6
  • 46
  • 82
  • 1
    Can you give me any sources on a way to implement loading an existing file? All i see on that link is for creating purposes, which is not of use in my case. – CosmicGiant Aug 13 '12 at 15:46
  • 4
    I have read that SXSSF works as a writing API. Loading should be done using XSSF without inputstream'ing the file (to avoid a full load of it in memory) – Alfabravo Aug 13 '12 at 16:59
  • How can I load it without ImputStream'ing the file then? The load (read) is the problem here, not the save (write). – CosmicGiant Aug 13 '12 at 17:32
  • 1
    That could be solved reading this thread. Hopefully http://stackoverflow.com/questions/11154678/xssfworkbook-takes-a-lot-of-time-to-load – Alfabravo Aug 13 '12 at 17:47
  • Nope. The following code still results in `OutOfMemoryError` with even with `-Xmx2048m`: `OPCPackage opcPackage = OPCPackage.open(file.getAbsolutePath()); workbook = new XSSFWorkbook(opcPackage);` --- There is still the event-based API (`XSSF_SAX`) to try, but I'm unsure how to do it. – CosmicGiant Aug 13 '12 at 19:59
  • @Panther yes, I added that to a comment right here, some time ago. Might add it to the actual answer to make it clearer. Thanks. – Alfabravo Dec 17 '15 at 22:01
8

Check this post. I show how to use SAX parser to process an XLSX file.

https://stackoverflow.com/a/44969009/4587961

In short, I extended org.xml.sax.helpers.DefaultHandler whih processes XML structure for XLSX filez. t is event parser - SAX.

class SheetHandler extends DefaultHandler {

    private static final String ROW_EVENT = "row";
    private static final String CELL_EVENT = "c";

    private SharedStringsTable sst;
    private String lastContents;
    private boolean nextIsString;

    private List<String> cellCache = new LinkedList<>();
    private List<String[]> rowCache = new LinkedList<>();

    private SheetHandler(SharedStringsTable sst) {
        this.sst = sst;
    }

    public void startElement(String uri, String localName, String name,
                             Attributes attributes) throws SAXException {
        // c => cell
        if (CELL_EVENT.equals(name)) {
            String cellType = attributes.getValue("t");
            if(cellType != null && cellType.equals("s")) {
                nextIsString = true;
            } else {
                nextIsString = false;
            }
        } else if (ROW_EVENT.equals(name)) {
            if (!cellCache.isEmpty()) {
                rowCache.add(cellCache.toArray(new String[cellCache.size()]));
            }
            cellCache.clear();
        }

        // Clear contents cache
        lastContents = "";
    }

    public void endElement(String uri, String localName, String name)
            throws SAXException {
        // Process the last contents as required.
        // Do now, as characters() may be called more than once
        if(nextIsString) {
            int idx = Integer.parseInt(lastContents);
            lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
            nextIsString = false;
        }

        // v => contents of a cell
        // Output after we've seen the string contents
        if(name.equals("v")) {
            cellCache.add(lastContents);
        }
    }

    public void characters(char[] ch, int start, int length)
            throws SAXException {
        lastContents += new String(ch, start, length);
    }

    public List<String[]> getRowCache() {
        return rowCache;
    }
}

And then I parse the XML presending XLSX file

private List<String []> processFirstSheet(String filename) throws Exception {
    OPCPackage pkg = OPCPackage.open(filename, PackageAccess.READ);
    XSSFReader r = new XSSFReader(pkg);
    SharedStringsTable sst = r.getSharedStringsTable();

    SheetHandler handler = new SheetHandler(sst);
    XMLReader parser = fetchSheetParser(handler);
    Iterator<InputStream> sheetIterator = r.getSheetsData();

    if (!sheetIterator.hasNext()) {
        return Collections.emptyList();
    }

    InputStream sheetInputStream = sheetIterator.next();
    BufferedInputStream bisSheet = new BufferedInputStream(sheetInputStream);
    InputSource sheetSource = new InputSource(bisSheet);
    parser.parse(sheetSource);
    List<String []> res = handler.getRowCache();
    bisSheet.close();
    return res;
}

public XMLReader fetchSheetParser(ContentHandler handler) throws SAXException {
    XMLReader parser = new SAXParser();
    parser.setContentHandler(handler);
    return parser;
}
Yan Khonski
  • 12,225
  • 15
  • 76
  • 114
  • 2
    where is your fetchSheetParser? – abr Mar 16 '18 at 12:22
  • @abr Edited, check again. – Yan Khonski Mar 16 '18 at 12:27
  • thanks, working as expected. Question: I'm using your code to read an extremly long xlsx file and happens that it gas 60 columns. However, each line gets returned 48 and some 59 (aparently it's ignoring the empty values on cells and not placing them in the array list). Tried to figure out how to output it on the startelement but so far, no success. Have any idea? – abr Mar 16 '18 at 12:36
  • Currently, it ignores empty cell. This is a homework for you - to fix this bug. When I have time, I will fix it myself and update the code. – Yan Khonski Mar 16 '18 at 12:45
  • Will try, if I manage the work-around, I'll send an edit request to your post :) tag me if you manage it before I do – abr Mar 16 '18 at 12:48
  • 1
    But how to get cell formula while reding using handler if using your code can you suggest me @YanKhonski – Umesh Sonawane Nov 29 '19 at 13:07
  • The answer in the link of this post has been updated to handle all rows and empty cells. – DaviM Apr 06 '21 at 18:00
0

Based on monitorjbl's answer and test suite explored from poi, following worked for me on multi-sheet xlsx file with 200K records (size > 50 MB):

import com.monitorjbl.xlsx.StreamingReader;
. . .
try (
        InputStream is = new FileInputStream(new File("sample.xlsx"));
        Workbook workbook = StreamingReader.builder().open(is);
) {
    DataFormatter dataFormatter = new DataFormatter();
    for (Sheet sheet : workbook) {
        System.out.println("Processing sheet: " + sheet.getSheetName());
        for (Row row : sheet) {
            for (Cell cell : row) {
                String value = dataFormatter.formatCellValue(cell);
            }
        }
    }
}
Azhar Khan
  • 3,829
  • 11
  • 26
  • 32
0

For latest code use this


InputStream file = new FileInputStream(
                    new File("uploads/" + request.getSession().getAttribute("username") + "/" + userFile));
Workbook workbook = StreamingReader.builder().rowCacheSize(100) // number of rows to keep in memory
                    .bufferSize(4096) // index of sheet to use (defaults to 0)
                    .open(file); // InputStream or File for XLSX file (required)

Iterator<Row> rowIterator = workbook.getSheetAt(0).rowIterator();
 while (rowIterator.hasNext()) {
     while (cellIterator.hasNext()) {
         Cell cell = cellIterator.next();
        String cellValue = dataFormatter.formatCellValue(cell);
     }}

-1

You can use SXXSF instead of using HSSF. I could generate excel with 200000 rows.