4

I'm trying to write a pretty large XLSX file (4M+ cells) and I'm having some memory issues.

I can't use SXSSF since I also need to read the existing cells in the template.

Is there anything I can do to reduce the memory footprint?
Perhaps combine streaming reading and streaming writing?

Nicklas A.
  • 6,501
  • 7
  • 40
  • 65
  • How much formatting / fancy stuff does your file have? And is there any chance you could arrange it so that new rows only need to go into new sheets? – Gagravarr Oct 05 '13 at 13:41
  • How of what we're writing is new cells and rows but we have to mark formulas as dirty or excel will display its own value. Perhaps it's worth looking into splitting it up into creation and updating and using SXSSF for creating only. – Nicklas A. Oct 07 '13 at 11:29

2 Answers2

5

To handle large data with low memory, the best and I think the only option is SXSSF api-s. If you need to read some data of the existing cells, I assume you do not need the entire 4M+ at the same time. In such a case based on your application requirement, you can handle the window size yourself and keep in memory only the amount of data you need at a particular time. You can start by looking at the example at : http://poi.apache.org/spreadsheet/how-to.html#sxssf

Something as

SXSSFWorkbook wb = new SXSSFWorkbook(-1); // turn off auto-flushing and accumulate all rows in memory
// manually control how rows are flushed to disk 
if(rownum % NOR == 0) {
((SXSSFSheet)sh).flushRows(NOR); // retain NOR last rows and flush all others

Hope this helps.

ParoTech
  • 347
  • 2
  • 15
  • The problem with SXSSF is that you can't read stuff at all. You can access stuff you've already written but you can't access stuff that's already in the workbook. – Nicklas A. Oct 07 '13 at 11:30
  • For reading use the Event Model, a good sample is given at http://poi.apache.org/spreadsheet/how-to.html - XSSF and SAX (Event API) I have used it myself, it is a bit complicated but serves the purpose of reading huge data without running out of memory. You can also look at the small samples at http://poi.apache.org/spreadsheet/quick-guide.html to get an idea. – ParoTech Oct 07 '13 at 14:49
  • But the thing is that we need to read a cell and then write that same cell again which will be tricky. – Nicklas A. Oct 08 '13 at 14:26
2

I used SAX parser to process events of the XML document presentation. This is

import com.sun.org.apache.xerces.internal.parsers.SAXParser;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;

import java.io.BufferedInputStream;
import java.io.InputStream;
import java.util.Collections;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;


public class LowMemoryExcelFileReader {

    private String file;

    public LowMemoryExcelFileReader(String file) {
        this.file = file;
    }

    public List<String[]> read() {
        try {
            return processFirstSheet(file);
        } catch (Exception e) {
           throw new RuntimeException(e);
        }
    }

    private List<String []> readSheet(Sheet sheet) {
        List<String []> res = new LinkedList<>();
        Iterator<Row> rowIterator = sheet.rowIterator();

        while (rowIterator.hasNext()) {
            Row row = rowIterator.next();
            int cellsNumber = row.getLastCellNum();
            String [] cellsValues = new String[cellsNumber];

            Iterator<Cell> cellIterator = row.cellIterator();
            int cellIndex = 0;

            while (cellIterator.hasNext()) {
                Cell cell = cellIterator.next();
                cellsValues[cellIndex++] = cell.getStringCellValue();
            }

            res.add(cellsValues);
        }
        return res;
    }

    public String getFile() {
        return file;
    }

    public void setFile(String file) {
        this.file = file;
    }

    private List<String []> processFirstSheet(String filename) throws Exception {
        OPCPackage pkg = OPCPackage.open(filename, PackageAccess.READ);
        XSSFReader r = new XSSFReader(pkg);
        SharedStringsTable sst = r.getSharedStringsTable();

        SheetHandler handler = new SheetHandler(sst);
        XMLReader parser = fetchSheetParser(handler);
        Iterator<InputStream> sheetIterator = r.getSheetsData();

        if (!sheetIterator.hasNext()) {
            return Collections.emptyList();
        }

        InputStream sheetInputStream = sheetIterator.next();
        BufferedInputStream bisSheet = new BufferedInputStream(sheetInputStream);
        InputSource sheetSource = new InputSource(bisSheet);
        parser.parse(sheetSource);
        List<String []> res = handler.getRowCache();
        bisSheet.close();
        return res;
    }

    public XMLReader fetchSheetParser(ContentHandler handler) throws SAXException {
        XMLReader parser = new SAXParser();
        parser.setContentHandler(handler);
        return parser;
    }

    /**
     * See org.xml.sax.helpers.DefaultHandler javadocs
     */
    private static class SheetHandler extends DefaultHandler {

        private static final String ROW_EVENT = "row";
        private static final String CELL_EVENT = "c";

        private SharedStringsTable sst;
        private String lastContents;
        private boolean nextIsString;

        private List<String> cellCache = new LinkedList<>();
        private List<String[]> rowCache = new LinkedList<>();

        private SheetHandler(SharedStringsTable sst) {
            this.sst = sst;
        }

        public void startElement(String uri, String localName, String name,
                                 Attributes attributes) throws SAXException {
            // c => cell
            if (CELL_EVENT.equals(name)) {
                String cellType = attributes.getValue("t");
                if(cellType != null && cellType.equals("s")) {
                    nextIsString = true;
                } else {
                    nextIsString = false;
                }
            } else if (ROW_EVENT.equals(name)) {
                if (!cellCache.isEmpty()) {
                    rowCache.add(cellCache.toArray(new String[cellCache.size()]));
                }
                cellCache.clear();
            }

            // Clear contents cache
            lastContents = "";
        }

        public void endElement(String uri, String localName, String name)
                throws SAXException {
            // Process the last contents as required.
            // Do now, as characters() may be called more than once
            if(nextIsString) {
                int idx = Integer.parseInt(lastContents);
                lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
                nextIsString = false;
            }

            // v => contents of a cell
            // Output after we've seen the string contents
            if(name.equals("v")) {
                cellCache.add(lastContents);
            }
        }

        public void characters(char[] ch, int start, int length)
                throws SAXException {
            lastContents += new String(ch, start, length);
        }

        public List<String[]> getRowCache() {
            return rowCache;
        }
    }
}
Yan Khonski
  • 12,225
  • 15
  • 76
  • 114
  • 1
    I tried your code and I happened to notice that it doesn't read the last row, are you aware of this? – Paolo Forgia Jul 20 '18 at 07:17
  • Looks like to handle empty cells, a cell must have a style. If you check generated xml file from excel. If it does contain a style, it appears inside xml file. Empty cell without data tag (enough to use the code above) – DaviM Apr 07 '21 at 13:18
  • Feel free to update the code, it was long time when I used it. Maybe the latest version of the library can solve this issue or provides an alternative way to solve this problem. – Yan Khonski Apr 07 '21 at 21:31