2

Apache POI does solve this but the total size of all the transitive dependencies of apache poi is close to 13MB.

  • poi-ooxml-schemas-3.17.jar (6MB)
  • xmlbeans-2.6.0.jar (3MB)
  • poi-3.17.jar (3MB)
  • poi-ooxml-3.17.jar (1.5MB)

So far I could find questions about writing to xlsx without apache poi and libraries which are not maintained. Any pointers would be very helpful.

Vivek Aditya
  • 1,145
  • 17
  • 46
  • Downvoters , please leave a comment. – Vivek Aditya Sep 10 '18 at 16:24
  • 1
    https://stackoverflow.com/questions/51571719/how-read-excel-file-having-more-than-100000-row-in-java/51582121#51582121 – Axel Richter Sep 10 '18 at 16:29
  • 1
    The problem is simply the format is really complex...and within the files it's xml... What is the problem with 13 MiB ? – khmarbaise Sep 10 '18 at 17:02
  • What's wrong with 13mb of jars? That's several orders of magnitude smaller than a great many projects I know of! – Gagravarr Sep 10 '18 at 20:59
  • @khmarbaise & Gagravarr Apache POI only is contributing to 50% of my build size. Just for reading a xlsx file this is a huge cost. – Vivek Aditya Sep 11 '18 at 06:48
  • @AxelRichter Will look into the code you pointed, thanks. – Vivek Aditya Sep 11 '18 at 06:48
  • @Vivek Aditya: Whether you can using plain Java instead of `apache poi` depends on what you are meaning by "just reading a xlsx file". A `*.xlsx` file may contain much more than simply textual or numeric data. There could be charts, pictures and other shapes, ActiveX or form controls, conditional formatting, pivot tables, ... To get all that using plain Java without using `apache poi` will be possible but also will be like reinventing the wheel then. But you are right, the module structure and division into different Jars of `apache poi` is really not optimal. – Axel Richter Sep 11 '18 at 07:10
  • 1
    The question is what kind of functionality you exactly need? But unfortunately you haven't told use.....And if the size is too large for you you might help to reduce the size (contribute to the project) ...the other options is to use a particular library which does what you need and maybe not more than you need ... – khmarbaise Sep 11 '18 at 08:21
  • 1
    @khmarbaise: To reduce the size cannot be the goal. The size will increase since `apache poi` is long nor not finished yet. But it could be more modular. We would need modules for working with the file systems, one for binary (OLE2) and one for OOXML and then single modules for each Office part (Word, Excel, Powerpoint) in each shaping (binary or OOXML). Maybe connected by interfaces for using both, binary and OOXML and/or across parts, which should be also modular though. But this is nothing what contributors could decise. This are decisions of the project leaders. – Axel Richter Sep 11 '18 at 09:00
  • Hm...in an Apache project there are no leaders cause the community is the lead ...Apart from that yes of course modularization can be the right direction but checking if other dependencies can be replaced maybe with other (smaller once) or if the dependencies are only used for very limited things...etc. that can reduce the size sometimes...and why can't be the size of a library not a goal ? – khmarbaise Sep 11 '18 at 09:18
  • @AxelRichter Agreed. My use case is straightforward , just read the content of xlsx file ( no charts , pictures or anything other than text ). – Vivek Aditya Sep 11 '18 at 10:40
  • @khmarbaise I just need to read a simple xlsx file ( no pictures , charts or anything complex). Looks like what Axel has suggested is the way ahead for me. Thanks for your inputs. – Vivek Aditya Sep 11 '18 at 10:41
  • @khmarbaise: As of my experience, if "there are no leaders cause the community is the lead " then nobody leads and the project will fail. There must be main contributors in the community who takes leadership. Most times this is exacly how it works, if it works. Else it will **not** work. And only those main contributors can make such main decisions. It's the same thing as with "public property". If "the public" owns something, then nobody owns it. And what nobody owns is worth nothing. That's how humans are. – Axel Richter Sep 11 '18 at 13:16

1 Answers1

1

Thanks to Axel Richter's great answer, I modified it into a Utility class. This so far is working very well for me and could help anyone who is looking just to read a simple XLSX file (without any complicated stuff in it).

import javax.xml.namespace.QName;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import java.nio.file.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;


public class XLSXUtil {

    public static void main(String args[]) throws Exception {
        fetchDataFromExcel();
    }

    public static List<List<String>> fetchDataFromExcel() throws Exception {
        List<List<String>> result = new ArrayList<>();
        XMLEventReader reader = null;
        XMLEvent event = null;
        Attribute attribute = null;
        StartElement startElement = null;
        EndElement endElement = null;
        String characters = null;

        StringBuilder stringValue = new StringBuilder(); //for collecting the characters to complete values
        List<String> sharedStrings = new ArrayList<String>(); //list of shared strings
        Map<String, String> numberFormats = new HashMap<String, String>(); //map of number formats
        List<String> cellNumberFormats = new ArrayList<String>(); //list of cell number formats
        Path source = Paths.get("Sample.xlsx"); //path to the Excel file
        FileSystem fs = FileSystems.newFileSystem(source, null); //get filesystem of Excel file

        //get shared strings ==============================================================================
        Path sharedStringsTable = fs.getPath("/xl/sharedStrings.xml");
        reader = XMLInputFactory.newInstance().createXMLEventReader(Files.newInputStream(sharedStringsTable));
        boolean siFound = false;
        while (reader.hasNext()) {
            event = (XMLEvent) reader.next();
            if (event.isStartElement()) {
                startElement = (StartElement) event;
                if (startElement.getName().getLocalPart().equalsIgnoreCase("si")) {
                    //start element of shared string item
                    siFound = true;
                    stringValue = new StringBuilder();
                }
            } else if (event.isCharacters() && siFound) {
                //chars of the shared string item
                characters = event.asCharacters().getData();
                stringValue.append(characters);
            } else if (event.isEndElement()) {
                endElement = (EndElement) event;
                if (endElement.getName().getLocalPart().equalsIgnoreCase("si")) {
                    //end element of shared string item
                    siFound = false;
                    sharedStrings.add(stringValue.toString());
                }
            }
        }
        reader.close();
        //shared strings ==================================================================================

        //get styles, number formats are essential for detecting date / time values =======================
        Path styles = fs.getPath("/xl/styles.xml");
        reader = XMLInputFactory.newInstance().createXMLEventReader(Files.newInputStream(styles));
        boolean cellXfsFound = false;
        while (reader.hasNext()) {
            event = (XMLEvent) reader.next();
            if (event.isStartElement()) {
                startElement = (StartElement) event;
                if (startElement.getName().getLocalPart().equalsIgnoreCase("numFmt")) {
                    //start element of number format
                    attribute = startElement.getAttributeByName(new QName("numFmtId"));
                    String numFmtId = attribute.getValue();
                    attribute = startElement.getAttributeByName(new QName("formatCode"));
                    numberFormats.put(numFmtId, ((attribute != null) ? attribute.getValue() : "null"));
                } else if (startElement.getName().getLocalPart().equalsIgnoreCase("cellXfs")) {
                    //start element of cell format setting
                    cellXfsFound = true;
                } else if (startElement.getName().getLocalPart().equalsIgnoreCase("xf") && cellXfsFound) {
                    //start element of format setting in cell format setting
                    attribute = startElement.getAttributeByName(new QName("numFmtId"));
                    cellNumberFormats.add(((attribute != null) ? attribute.getValue() : "null"));
                }
            } else if (event.isEndElement()) {
                endElement = (EndElement) event;
                if (endElement.getName().getLocalPart().equalsIgnoreCase("cellXfs")) {
                    //end element of cell format setting
                    cellXfsFound = false;
                }
            }
        }
        reader.close();
        //styles ==========================================================================================

        //get sheet data of first sheet ===================================================================
        Path sheet1 = fs.getPath("/xl/worksheets/sheet1.xml");
        reader = XMLInputFactory.newInstance().createXMLEventReader(Files.newInputStream(sheet1));
        boolean rowFound = false;
        boolean cellFound = false;
        boolean cellValueFound = false;
        boolean inlineStringFound = false;
        String cellStyle = null;
        String cellType = null;
        List<String> row = null;
        while (reader.hasNext()) {
            event = (XMLEvent) reader.next();
            if (event.isStartElement()) {
                startElement = (StartElement) event;
                if (startElement.getName().getLocalPart().equalsIgnoreCase("row")) {
                    //start element of row
                    rowFound = true;
                    if (row != null) {
                        result.add(row);
                    }
                    row = new ArrayList<>();
                } else if (startElement.getName().getLocalPart().equalsIgnoreCase("c") && rowFound) {
                    //start element of cell in row
                    cellFound = true;
                    attribute = startElement.getAttributeByName(new QName("t"));
                    cellType = ((attribute != null) ? attribute.getValue() : null);
                    attribute = startElement.getAttributeByName(new QName("s"));
                    cellStyle = ((attribute != null) ? attribute.getValue() : null);
                } else if (startElement.getName().getLocalPart().equalsIgnoreCase("v") && cellFound) {
                    //start element of value in cell
                    cellValueFound = true;
                    stringValue = new StringBuilder();
                } else if (startElement.getName().getLocalPart().equalsIgnoreCase("is") && cellFound) {
                    //start element of inline string in cell
                    inlineStringFound = true;
                    stringValue = new StringBuilder();
                }
            } else if (event.isCharacters() && cellFound && (cellValueFound || inlineStringFound)) {
                //chars of the cell value or the inline string
                characters = event.asCharacters().getData();
                stringValue.append(characters);
            } else if (event.isEndElement()) {
                endElement = (EndElement) event;
                if (endElement.getName().getLocalPart().equalsIgnoreCase("row")) {
                    //end element of row
                    rowFound = false;
                } else if (endElement.getName().getLocalPart().equalsIgnoreCase("c")) {
                    //end element of cell
                    cellFound = false;
                } else if (endElement.getName().getLocalPart().equalsIgnoreCase("v")) {
                    //end element of value
                    cellValueFound = false;
                    String cellValue = stringValue.toString();
                    if ("s".equals(cellType)) {
                        cellValue = sharedStrings.get(Integer.valueOf(cellValue));
                    }
                    row.add(cellValue);
                } else if (endElement.getName().getLocalPart().equalsIgnoreCase("is")) {
                    //end element of inline string
                    inlineStringFound = false;
                    String cellValue = stringValue.toString();
                    row.add(cellValue);
                }
            }
        }
        reader.close();
        //sheet data ======================================================================================
        fs.close();
        return result;
    }
}
Vivek Aditya
  • 1,145
  • 17
  • 46