How does the POI Event API read data from Excel and why does it use less RAM?

Question

I am currently writing my bachelor thesis and I am using the POI Event API from Apache. In short, my work is about a more efficient way to read data from Excel.

I get asked by developers again and again how exactly this is meant with Event API. Unfortunately I don't find anything on the Apache page about the basic principle.

Following code, how I use the POI Event API (This is from the Apache example for XSSF and SAX):

import java.io.InputStream;
import java.util.Iterator;

import org.apache.poi.ooxml.util.SAXHelper;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;

import javax.xml.parsers.ParserConfigurationException;

public class ExampleEventUserModel {
    public void processOneSheet(String filename) throws Exception {
        OPCPackage pkg = OPCPackage.open(filename);
        XSSFReader r = new XSSFReader( pkg );
        SharedStringsTable sst = r.getSharedStringsTable();

        XMLReader parser = fetchSheetParser(sst);

        // To look up the Sheet Name / Sheet Order / rID,
        //  you need to process the core Workbook stream.
        // Normally it's of the form rId# or rSheet#
        InputStream sheet2 = r.getSheet("rId2");
        InputSource sheetSource = new InputSource(sheet2);
        parser.parse(sheetSource);
        sheet2.close();
    }

    public void processAllSheets(String filename) throws Exception {
        OPCPackage pkg = OPCPackage.open(filename);
        XSSFReader r = new XSSFReader( pkg );
        SharedStringsTable sst = r.getSharedStringsTable();

        XMLReader parser = fetchSheetParser(sst);

        Iterator<InputStream> sheets = r.getSheetsData();
        while(sheets.hasNext()) {
            System.out.println("Processing new sheet:\n");
            InputStream sheet = sheets.next();
            InputSource sheetSource = new InputSource(sheet);
            parser.parse(sheetSource);
            sheet.close();
            System.out.println("");
        }
    }

    public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException, ParserConfigurationException {
        XMLReader parser = SAXHelper.newXMLReader();
        ContentHandler handler = new SheetHandler(sst);
        parser.setContentHandler(handler);
        return parser;
    }

    /**
     * See org.xml.sax.helpers.DefaultHandler javadocs
     */
    private static class SheetHandler extends DefaultHandler {
        private SharedStringsTable sst;
        private String lastContents;
        private boolean nextIsString;

        private SheetHandler(SharedStringsTable sst) {
            this.sst = sst;
        }

        public void startElement(String uri, String localName, String name,
                                 Attributes attributes) throws SAXException {
            // c => cell
            if(name.equals("c")) {
                // Print the cell reference
                System.out.print(attributes.getValue("r") + " - ");
                // Figure out if the value is an index in the SST
                String cellType = attributes.getValue("t");
                if(cellType != null && cellType.equals("s")) {
                    nextIsString = true;
                } else {
                    nextIsString = false;
                }
            }
            // Clear contents cache
            lastContents = "";
        }

        public void endElement(String uri, String localName, String name)
                throws SAXException {
            // Process the last contents as required.
            // Do now, as characters() may be called more than once
            if(nextIsString) {
                int idx = Integer.parseInt(lastContents);
                lastContents = sst.getItemAt(idx).getString();
                nextIsString = false;
            }

            // v => contents of a cell
            // Output after we've seen the string contents
            if(name.equals("v")) {
                System.out.println(lastContents);
            }
        }

        public void characters(char[] ch, int start, int length) {
            lastContents += new String(ch, start, length);
        }
    }

    public static void main(String[] args) throws Exception {
        ExampleEventUserModel example = new ExampleEventUserModel();
        example.processOneSheet(args[0]);
        example.processAllSheets(args[0]);
    }
}

Can someone please explain to me how the Event API works? Is it the same as the event-based architecture or is it something else?

Did you read the intro at http://poi.apache.org/components/spreadsheet/how-to.html#xssf_sax_api? — Gagravarr, Jan 30 '19 at 13:33

Axel Richter · Accepted Answer · 2019-01-31T07:09:01.643

A *.xlsx file, which is Excel stored in Office Open XML and is what apache poi handles as XSSF, is a ZIP archive containing the data in XML files within a directory structure. So we can unzip the *.xlsx file and get the data directly from the XML files then.

There is /xl/sharedStrings.xml having all the string cell values in it. And there is /xl/workbook.xml describing the workbook structure. And there are /xl/worksheets/sheet1.xml, /xl/worksheets/sheet2.xml, ... which are storing the sheets' data. And there is /xl/styles.xml having the style settings for all cells in the sheets.

Per default while creating a XSSFWorkbook all those parts of the *.xlsx file will become object representations as XSSFWorkbook, XSSFSheet, XSSFRow, XSSFCell, ... and further objects of org.apache.poi.xssf.*.* in memory.

To get an impression of how memory consuming XSSFSheet, XSSFRow and XSSFCell are, a look into the sources will be good. Each of those objects contains multiple Lists and Maps as internally members and of course multiple methods too. Now imagine a sheet having hundreds of thousands of rows each containing up to hundreds of cells. Each of those rows and cells will be represented by a XSSFRow or a XSSFCell in memory. This cannot be an accusation to apache poi because those objects are necessary if working with those objects is needed. But if the need is really only getting the content out of the Excel sheet, then those objects are not all necessary. That's why the XSSF and SAX (Event API) approach.

So if the need is only reading data from sheets one could simply parsing the XML of all the /xl/worksheets/sheet[n].xml files without the need for creating memory consuming objects for each sheet and for each row and for each cell in those sheets.

Parsing XML in event based mode means that the code goes top down through the XML and has callback methods defined which get called if the code detects the start of an element, the end of an element or character content within an element. The appropriate callback methods then handle what to do on start, end or with character content of an element. So reading the XML file only means running top down through the file once, handle the events (start, end, character content of an element) and are able getting all needed content out of it. So memory consuming is reduced to storing the text data gotten from the XML.

XSSF and SAX (Event API) uses class SheetHandler which extends DefaultHandler for this.

But if we are already at this level where we get at the underlying XML data and process it, then we could go one more step back too. Native Java is able handling ZIP and parsing XML. So we would not even need additional libraries at all. See how read excel file having more than 100000 row in java? where I have shown this. My code uses Package javax.xml.stream which also provides using event based XMLEventReader but not using callbacks but linear code. Maybe this code is simpler to understand because it is all in one.

For detecting whether a number format is a date format, and so the formatted cell contains a date / time value, one single apache poi class org.apache.poi.ss.usermodel.DateUtil is used. This is done to simplify the code. Of course even this class we could have coded our self.

How does the POI Event API read data from Excel and why does it use less RAM?

1 Answers1