0

I want to read a huge csv file by Java. It includes 75,000,000 lines. The problem is, even though I am using maximum xms and xmx limits, i am getting: `java.lang.OutOfMemoryError(GC overhead limit exceeded), and it shows this line causes the error:

String[][] matrix = new String[counterRow][counterCol];

I did some tests and see that i can read 15,000,000 lines well. Therefore I started to use this sort of code:

String csvFile = "myfile.csv";
List<String[]> rowList = new ArrayList();
String line = "";
String cvsSplitBy = ",";
BufferedReader br = null;
try {
    int counterRow = 0, counterCol = 12, id = 0;
    br = new BufferedReader(new FileReader(csvFile));
    while ((line = br.readLine()) != null) { 
        String[] object = line.split(cvsSplitBy);
        rowList.add(object); 
        counterRow++;
        if (counterRow % 15000000 ==0) {
            String[][] matrix = new String[counterRow][counterCol];
            .. do processes ..
            SaveAsCSV(matrix,id);
            counterRow=0; id++; rowList.clear();
        }
    }
}
...

Here, it writes first 15.000.000 lines very well, but in the second trial, this again gives the same error, although counterRow is 15,000,000.

In summary, I need to read a csv file that includes 75,000,000 rows (approx 5 GB) in Java and save a new csv file or files after doing some processes with its records.

How can I solve this problem?

Thanks

EDIT: I am also using rowList.clear() guys, forgot to specify here. sorry.

EDIT 2: My friends, I dont need to put all file in memory. How can I read it part by part. Actually this is what I tried to do by using if(counterRow%15000000==0). What is its correct way?

  • 2
    That's a huge amount of data to have in memory - why don't you try writing to a database, then querying it? – NoBugs Aug 07 '14 at 14:25
  • 1
    You definitely can't bring the whole goddamn file into memory. Can you process the file in batches/parts? – webuster Aug 07 '14 at 14:25
  • 1
    Memory mapped files? http://javarevisited.blogspot.de/2012/01/memorymapped-file-and-io-in-java.html – Fildor Aug 07 '14 at 14:26
  • If your file is 5GB and you want to keep it in memory, you'll need at lead 5 GB of RA% I think, huge ^^ – singe3 Aug 07 '14 at 14:27
  • Memory mapped files can also map only portions of files ... – Fildor Aug 07 '14 at 14:27
  • 1
    `streaming` is your best friend here – injecteer Aug 07 '14 at 14:27
  • Do you really need to put the whole file in memory? Do you clear rowList in the if(counterRow % 15000000 ==0) block somewhere in do processes? – StephaneM Aug 07 '14 at 14:28
  • [http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly](http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly) – Victor Aug 07 '14 at 14:29
  • also, I don't see any `rowList.clear()` in your code, meaning you are NOT batching – injecteer Aug 07 '14 at 14:29
  • Does every row stand for it self or does your ... proccess ... part of the code for example filter out duplicates or sort things? If the rows are not connected you don't need the rowlist - just process line by line. If the rows are connected you can use file based sorting or more simple feed it to a database. – Thomas Köhne Aug 07 '14 at 14:46
  • I have 32GB RAM and I am using rowList.clear() –  Aug 07 '14 at 14:56
  • The issue is not that you do not have enough memory, the issue "GC overhead limit exceeded" means that the garbage collection is taking too long. – GeertPt Aug 07 '14 at 15:46

4 Answers4

4

You can read the lines individually then do your processing until you have read the entire file

String encoding = "UTF-8";
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
String line;
while ((line = br.readLine()) != null) {
   // process the line.
}
br.close();

this should not go fubar just make sure you proces it immediatly and don't store it in variables outside your loop

Arno_Geismar
  • 2,296
  • 1
  • 15
  • 29
1

The issue is not that you do not have enough memory, the issue "GC overhead limit exceeded" means that the garbage collection is taking too long. You cannot fix this by allocating more memory, but only by using -XX:-UseGCOverheadLimit. That is, if you really want that much data in memory.

See e.g. How to solve "GC overhead limit exceeded" using maven jvmArg?

Or use peter lawrey's memory-mapped HugeCollections: http://vanillajava.blogspot.be/2011/08/added-memory-mapped-support-to.html?q=huge+collections : It writes to disk if the memory is full.

Community
  • 1
  • 1
GeertPt
  • 16,398
  • 2
  • 37
  • 61
  • Ah nice point. I am using rowList.clear() also, forgot to copy/paste here! –  Aug 07 '14 at 14:54
0

Maybe you forgot to call

rowList.clear();

after

counterRow=0; id++;
  • Ah nice point. I am using rowList.clear() also, forgot to copy/paste here! –  Aug 07 '14 at 14:52
0

The “java.lang.OutOfMemoryError: GC overhead limit exceeded” error will be displayed when your application has exhausted pretty much all the available memory and GC has repeatedly failed to clean it.

The solution recommended above - specifying a -XX:-UseGCOverheadLimit is something I strongly suggest not to do. Instead of fixing the problem you are just postponing the inevitable: the application is running out of memory and needs to be fixed. Specifying this option just masks the original “java.lang.OutOfMemoryError: GC overhead limit exceeded” error with a more familiar message “java.lang.OutOfMemoryError: Java heap space”.

Possible solutions pretty much boil down to two reasonable alternatives in your case - either increase heap space (-Xmx parameter) or reduce the heap consumption of your code by reading the file in smaller batches.

Flexo
  • 87,323
  • 22
  • 191
  • 272
Ivo
  • 444
  • 3
  • 7