I had the same problem with 800,000 cells and 3M characters where XSSF allocates 1GB of heap!
I used Python with openpyxl
and numpy
to read the xlsx file (from Java code) and first convert it to a normal text. Then I loaded the text file in java. It may seems to have large overhead, but it is indeed fast.
The python script looks like
import openpyxl as px
import numpy as np
# xlsx file is given through command line foo.xlsx
fname = sys.argv[1]
W = px.load_workbook(fname, read_only = True)
p = W.get_sheet_by_name(name = 'Sheet1')
a=[]
# number of rows and columns
m = p.max_row
n = p.max_column
for row in p.iter_rows():
for k in row:
a.append(k.value)
# convert list a to matrix (for example maxRows*maxColumns)
aa= np.resize(a, [m, n])
# output file is also given in the command line foo.txt
oname = sys.argv[2]
print (oname)
file = open(oname,"w")
mm = m-1
for i in range(mm):
for j in range(n):
file.write( "%s " %aa[i,j] )
file.write ("\n")
# to prevent extra newline in the text file
for j in range(n):
file.write("%s " %aa[m-1,j])
file.close()
Then in my java code, I wrote
try {
// `pwd`\python_script foo.xlsx foo.txt
String pythonScript = System.getProperty("user.dir") + "\\exread.py ";
String cmdline = "python " + pythonScript +
workingDirectoryPath + "\\" + fullFileName + " " +
workingDirectoryPath + "\\" + shortFileName + ".txt";
Process p = Runtime.getRuntime().exec(cmdline);
int exitCode = p.waitFor();
if (exitCode != 0) {
throw new IOException("Python command exited with " + exitCode);
}
} catch (IOException e) {
System.out.println( e.getMessage() );
} catch (InterruptedException e) {
ReadInfo.append(e.getMessage() );
}
After that, you will get foo.txt which is similar to foo.xlsx, but in text format.