0

I am trying to read a file (tab or csv file) in java with roughly 3m rows; have also added the virtual machine memory to -Xmx6g. The code works fine with 400K rows for tab separated file and slightly less for csv file. There are many LinkedHashMaps and Vectors involved that I try to use System.gc() after every few hundred rows in order to free memory and garbage values. However, my code gives the following error after 400K rows.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

at java.util.Vector.<init>(Vector.java:111)
at java.util.Vector.<init>(Vector.java:124)
at java.util.Vector.<init>(Vector.java:133)
at cleaning.Capture.main(Capture.java:110)
Ramin
  • 891
  • 2
  • 10
  • 16
  • 7
    `System.gc()` calls are wasted effort. You may freely remove them. – Marko Topolnik Nov 06 '13 at 19:35
  • 4
    Is it time to use a database? – Hovercraft Full Of Eels Nov 06 '13 at 19:35
  • 1
    You may want to rethink your approach for processing this amount of data, don't try to load everything in memory. You could try to process it chunk wise (down to line by line). - What you have implemented seems to be everything but scalable. – A4L Nov 06 '13 at 19:40
  • 1
    http://stackoverflow.com/questions/14037404/java-read-large-text-file-with-70million-line-of-text http://stackoverflow.com/questions/2356137/read-large-files-in-java – Mahdi Esmaeili Nov 06 '13 at 19:50

1 Answers1

4

Your attempt to load the whole file is fundamentally ill-fated. You may optimize all you want, but you'll just be pushing the upper limit slightly higher. What you need is eradicate the limit itself.

There is a very negligible chance that you actually need the whole contents in memory all at once. You probably need to calculate something from that data, so you should start working out a way to make that calculation chunk by chunk, each time being able to throw away the processed chunk.

If your data is deeply intertwined, preventing you from serializing your calculation, then the reasonable recourse is, as HovercraftFOE mentions above, transfering the data into a database and work from there, indexing everything you need, normalizing it, etc.

Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436