A *.xlsx
file is a ZIP
archive containing the data of the Excel
in a directory structure having different XML
files.
For example there are
/xl/workbook.xml
describing the basic workbook structure,
/xl/worksheets/sheet1.xml
, /xl/worksheets/sheet2.xml
, ...
/xl/worksheets/sheetN.xml
having the sheet data - Here are the rows
and the cells but not all data within the cells are directly stored
there. Also the cell styles are not directly stored there. - ,
/xl/styles.xml
which contains the cell styles,
/xl/sharedStrings.xml
which contains all string content of cells in
all sheets. This is to avoid multiple storing the same string much
times if this string is used multiple times in cells.
So if you wants to read the *.xlsx
ZIP
archive, you needs unpacking the ZIP
archive and then parsing at least the four XML
files mentioned above to get the data for the XSSFWorkbook
. This is what apache poi
does while XSSFWorkbook wb = new XSSFWorkbook(fileinputstream);
.
So if you really needs a XSSFWorkbook
as the result, there is no way around this process. And if you not suspects that apache poi
had programmed explicit delay routines, then there will not be a possibility to reduce the amount of time for this process.
Your approach only to read less rows than are stored into the sheet, could possibly be time saving. But then your result would be a XSSFWorkbook
containing all the styles and all the string contents but only some sheet data related to those styles and string data. So it will lead to a partially broken XSSFWorkbook
. Thats why nobody has really thought about this approach.
Only if the requirement is only to read the plain unformatted data from one of the /xl/worksheets/sheetN.xml
without creating a XSSFWorkbook
, then you only needs unpacking the ZIP
archive and then parsing only the needed /xl/worksheets/sheetN.xml
and the /xl/sharedStrings.xml
to get the string content of the cells from. This would be possible in less time than the whole process described above.