1

I have a large spread sheet. It has 10 sheets, each with 1m rows. With Java, I need to run an algorithm for each row, return a value for each row and insert back into the excel file.

My idea was to load the file into ram, do calculations for each row, store the result in a list, and insert back to excel in order, but I didn't anticipate the issues dealing with the data size.

I tried XSSF, and it wasn't able to load such a large file. After waiting for a few hours it gave me the OOM error.

I tried increasing heap in run->run configurations->arguments, and in control panel->java. It didn't work.

I tried using the following StreamingReader and it didn't work.

FileInputStream in = new FileInputStream("D:\\work\\calculatepi\\sampleresult.xlsx");
Workbook workbook = StreamingReader.builder()
    .rowCacheSize(100)  
    .bufferSize(4096)  
    .open(in);  

I'm really out of clue and not sure what to do. Is there no easy way to do this?

GhostCat
  • 137,827
  • 25
  • 176
  • 248
Andy
  • 167
  • 9
  • 2
    Please provide a [mcve] which shows the error, as well as the output of any memory and CPU profiling that you have done. – Joe C Dec 09 '18 at 08:15
  • "Is there no easy way to do this?" Yes there is no easy way because of the complex structure of `*.xlsx` files. But `StreamingReader` is the correct approach for reading. How exactly it didn't work? – Axel Richter Dec 09 '18 at 08:16
  • 2
    If you have 10 x 1 million rows then you need a database – Joakim Danielson Dec 09 '18 at 08:16
  • @JoakimDanielson I was given this large excel file and not a database. It's the only thing I have to work with. – Andy Dec 09 '18 at 10:42
  • I appreciate the quick comeback. Is there anything I can add to the question to also make it upvote worthy? – GhostCat Dec 24 '18 at 05:49

1 Answers1

0

It is not only about the configuration of that library. It is also about the memory that you give to you JVM! Try increasing the heap space of the JVM, see here for example.

Beyond that: I think you should do two things:

  • make experiments with smaller sheets. Create one that only has 100 rows, then maybe 10K, 100K. Measure the memory consumption. And from there
  • see if there are other APIs/libraries that allow you to read/write individual rows without pulling the whole file into memory
  • and if none of that works, maybe you have to use a completely different design: such as just having some sort of "service". And now, you write some VB script code that you run inside excel, that simply for each row calls that service to fetch the results. Or, ideally: do not misuse Excel as database. This is similar to using a sports car to transport a huge number of goods, just because you already have that sports car. But it would still be more appropriate to get yourself a truck instead. In other words: consider moving your data into a real database. In the long run, everything you do will be "easier" then!
GhostCat
  • 137,827
  • 25
  • 176
  • 248
  • 1
    Hey thanks, I ended up solving this by breaking this huge excel spreadsheet into multiple csv files and just worked with IO streams. I copy pasted the result csv files into the excel manually. It was hideous work, but it worked. I appreciate your suggestions. – Andy Dec 24 '18 at 02:56