Processing large xlsx file

Question

I need to auto-fit all rows in large (30k+ rows) xlsx file.

The following code via apache poi works on small files, but goes out with OutOfMemoryError on large ones:

Workbook workbook = WorkbookFactory.create(inputStream);
Sheet sheet = workbook.getSheetAt(0);

for (Row row : sheet) {
    row.setHeight((short) -1);
}

workbook.write(outputStream);

Update: Unfortunately, increasing heap size is not an option - OutOfMemoryError appears at -Xmx1024m and 30k rows is not an upper limit.

Where you runing this code? Inside App/Web server or standalone? — JSS, Feb 04 '11 at 12:28

score 37 · Accepted Answer · edited Feb 27 '15 at 13:48

Try using the event API. See Event API (HSSF only) and XSSF and SAX (Event API) in the POI documentation for details. A couple of quotes from that page:

HSSF:

The event API is newer than the User API. It is intended for intermediate developers who are willing to learn a little bit of the low level API structures. Its relatively simple to use, but requires a basic understanding of the parts of an Excel file (or willingness to learn). The advantage provided is that you can read an XLS with a relatively small memory footprint.

XSSF:

If memory footprint is an issue, then for XSSF, you can get at the underlying XML data, and process it yourself. This is intended for intermediate developers who are willing to learn a little bit of low level structure of .xlsx files, and who are happy processing XML in java. Its relatively simple to use, but requires a basic understanding of the file structure. The advantage provided is that you can read a XLSX file with a relatively small memory footprint.

For output, one possible approach is described in the blog post Streaming xlsx files. (Basically, use XSSF to generate a container XML file, then stream the actual content as plain text into the appropriate xml part of the xlsx zip archive.)

Hi am also having the same problem of reading large excel files. Getting out of memory issues. I have seen the http://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api and it does not specify how to read the excel files. Please help. — ajm, Feb 20 '12 at 13:38
@Ashish: Please post your request as a separate question on Stack Overflow with more details. That way, other users can help you as well. — markusk, Mar 06 '12 at 13:59
For reading large Excel files you can take a look on this tiny and simple library: https://github.com/davidpelfree/sjxlsx — David Peleg, Jan 19 '15 at 09:02

rjdkolb · Answer 2 · 2013-07-09T07:36:39.240

13

A dramatic improvement in memory usage can be done by using a File instead of a Stream. (It is better to use a streaming API, but the Streaming API's have limitations, see http://poi.apache.org/spreadsheet/index.html)

So instead of

Workbook workbook = WorkbookFactory.create(inputStream);

do

Workbook workbook = WorkbookFactory.create(new File("yourfile.xlsx"));

This is according to : http://poi.apache.org/spreadsheet/quick-guide.html#FileInputStream

Files vs InputStreams

"When opening a workbook, either a .xls HSSFWorkbook, or a .xlsx XSSFWorkbook, the Workbook can be loaded from either a File or an InputStream. Using a File object allows for lower memory consumption, while an InputStream requires more memory as it has to buffer the whole file."

edited Jul 09 '13 at 07:36

answered Jul 09 '13 at 07:29

rjdkolb

10,377
11
69
89

This gives me an error stating: `Caught: java.lang.LinkageError: loader constraint violation: when resolving interface method "org.xml.sax.XMLReader.setEntityResolver(Lorg/xml/sax/EntityResolver;)V" the class loader (instance of org/ codehaus/groovy/tools/RootLoader) of the current class, org/dom4j/io/SAXReader, and the class loader (instance of ) for the method's defining class, org/xml/sax/XMLReader, have different C lass objects for the type org/xml/sax/EntityResolver used in the signature` I am using `poi-3.9` – kiltek Jun 16 '16 at 08:48
1

@rjdkolb can you see my post https://stackoverflow.com/questions/48772021/how-to-solve-the-java-heap-space-error-while-loading-the-large-xls-file-using-po – Mandrek Feb 13 '18 at 17:13
Nothing improves when using file upwards of 15mb. I've set -Xmx2048m and yet it throws out of memory errors. – saran3h Mar 22 '21 at 14:43

score 3 · Answer 3 · answered May 23 '13 at 04:34

I was having the same problem with a lot less of row, but large strings.

Since I don't have to keep my data loaded, I found out that I can use SXSSF instead of XSSF.

They have similar interfaces, which helps if you have a lot of code already writen. But with SXSSF it is possible to set the amount of rows you keep loaded.

Here is the link. http://poi.apache.org/spreadsheet/how-to.html#sxssf

score 2 · Answer 4 · answered Feb 28 '15 at 06:54

If you want to auto-fit or set styles or write all rows in large (30k+ rows) xlsx file,use SXSSFWorkbook.Here is the sample code that helps you...

SXSSFWorkbook wb = new SXSSFWorkbook();
            SXSSFSheet sheet = (SXSSFSheet) wb.createSheet("writetoexcel");
            Font font = wb.createFont();
                font.setBoldweight((short) 700);
                // Create Styles for sheet.
                XSSFCellStyle Style = (XSSFCellStyle) wb.createCellStyle();
                Style.setFillForegroundColor(new XSSFColor(java.awt.Color.LIGHT_GRAY));
                Style.setFillPattern(XSSFCellStyle.SOLID_FOREGROUND);
                Style.setFont(font);
                //iterating r number of rows
            for (int r=0;r < 30000; r++ )
            {
                Row row = sheet.createRow(r);
                //iterating c number of columns
                for (int c=0;c < 75; c++ )
                {
                    Cell cell = row.createCell(c);
                    cell.setCellValue("Hello"); 
                    cell.setCellStyle(Style);
                }
    }
            FileOutputStream fileOut = new FileOutputStream("E:" + File.separator + "NewTest.xlsx");

score 1 · Answer 5 · answered Mar 17 '11 at 01:28

1

I used Event API for a HSSF file (.xls), and I discovered terrible lack of documentation about order of records.

answered Mar 17 '11 at 01:28

ArturoTena

713
5
15

I know this is old: but did you found anything about the order of the events in HSSF and/or XSSF? – cripox Mar 17 '16 at 14:52

score 0 · Answer 6 · answered Aug 04 '13 at 18:00

0

If you are writing to XLSX, I found an improvement by writing to different sheets of the same Excel file. You also might find an improvement by writing to different Excel files. But first try writing to different sheets.

answered Aug 04 '13 at 18:00

Alexander Mills

90,741
139
482
817

score 0 · Answer 7 · edited May 23 '17 at 10:31

The best example for this is described in the following stack overflow thread: Error While Reading Large Excel Files (xlsx) Via Apache POI

The code snippet in the main answer in that topic illustrates the Apache POI wrappings around SAX xml parsing, and how you can trivially loop over all the sheets and then over each individual cell.

The code is stale with current implementation of the Apache POI API, as the endRow() api provides the current row number that has finished to be processing.

With that code snippet it should be trivial for your to parse a big XLSX file cell by cell. E.g. for each sheet; for each row cell; row has ended event. You could trivial create app logic where at the of each row you create a Map of columneName to cellValue.

mahmood · Answer 8 · 2017-05-06T05:10:32.320

I had the same problem with 800,000 cells and 3M characters where XSSF allocates 1GB of heap!

I used Python with openpyxl and numpy to read the xlsx file (from Java code) and first convert it to a normal text. Then I loaded the text file in java. It may seems to have large overhead, but it is indeed fast.

The python script looks like

import openpyxl as px
import numpy as np

# xlsx file is given through command line foo.xlsx
fname = sys.argv[1]
W = px.load_workbook(fname, read_only = True)
p = W.get_sheet_by_name(name = 'Sheet1')

a=[]
# number of rows and columns
m = p.max_row
n = p.max_column

for row in p.iter_rows():
    for k in row:
        a.append(k.value)

# convert list a to matrix (for example maxRows*maxColumns)
aa= np.resize(a, [m, n])

# output file is also given in the command line foo.txt
oname = sys.argv[2]
print (oname)
file = open(oname,"w")
mm = m-1
for i in range(mm):
    for j in range(n):
        file.write( "%s " %aa[i,j]  )
    file.write ("\n")

# to prevent extra newline in the text file
for j in range(n):
    file.write("%s " %aa[m-1,j])

file.close()

Then in my java code, I wrote

try {
  // `pwd`\python_script  foo.xlsx  foo.txt
  String pythonScript =  System.getProperty("user.dir") + "\\exread.py ";
  String cmdline = "python " + pythonScript +
                    workingDirectoryPath + "\\" + fullFileName + " " + 
                    workingDirectoryPath + "\\" + shortFileName + ".txt";
  Process p = Runtime.getRuntime().exec(cmdline);
  int exitCode = p.waitFor();
  if (exitCode != 0) {
    throw new IOException("Python command exited with " + exitCode);
  }
} catch (IOException e) {
  System.out.println( e.getMessage() );
} catch (InterruptedException e) {
  ReadInfo.append(e.getMessage() );
}

After that, you will get foo.txt which is similar to foo.xlsx, but in text format.

score 0 · Answer 9 · answered Jul 07 '17 at 10:57

0

I used SAX parser to process XML structure. It works for XLSX files.

https://stackoverflow.com/a/44969009/4587961

answered Jul 07 '17 at 10:57

Yan Khonski

12,225
15
76
114

Processing large xlsx file

9 Answers9

Linked

Related