0

I am trying to read a lot of data(10k-20k records) from files (10 threads running for 10 mins). I get an exception:

Exception in thread "main" Exception in thread "Thread-26" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Unknown Source)
    at java.lang.String.<init>(Unknown Source)
    at java.io.BufferedReader.readLine(Unknown Source)
    at java.io.BufferedReader.readLine(Unknown Source)

I get the above error message for the code snippet below. I have been trying to debug this : and the closest I have come is to use CharSequence. But I still get the heap exception. (At this point - can anyone help me understand why CharSequence would be better? => It seems it'll load smaller amount of data in main memory, but eventually all the data needs to be in the main memory).

I am able to run the code if for 1 min. But anything near 10 mins explodes. Is there an efficient way to read the files?

**This code is part of a research and I am still re-factoring it, so a lot of inefficient code does exist.

    try{
        for(int i=0; i<threadCount; i++){
            fstream = new FileInputStream(dir+"//read"+machineid+"-"+i + ".txt");
            // Use DataInputStream to read binary NOT text.
            BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
            String line;
            // Read File Line By Line
            String[] tokens;

            while ((line = br.readLine()) != null) {
                tokens = line.split(",");
                logObject record = new logObject(tokens[0], tokens[1], tokens[2],tokens[3], tokens[4], tokens[5], tokens[6], tokens[7], "", tokens[8]);
                toBeProcessed[toBeProcessedArraySz] = record;
                toBeProcessedArraySz++;
                if(readToValidate == toBeProcessedArraySz){

                    try {
                        semaphore.acquire();
                    } catch (InterruptedException e) {
                        e.printStackTrace(System.out);
                    }
                    //create thread to process the read records
                    ValidationThread newVThread = new ValidationThread(props,toBeProcessed, updateStats, initCnt, semaphore, finalResults, staleSeqSemaphore, staleSeqTracker, seqTracker, seenSeqSemaphore, toBeProcessedArraySz, freshnessBuckets,bucketDuration);
                    vThreads.add(newVThread);
                    toBeProcessedArraySz = 0;
                    toBeProcessed = new logObject[readToValidate];
                    semaphore.release();
                    newVThread.start();
                }                       
            }
            br.close();//remove to test
            fstream.close();                
        }

    }catch(Exception e){
        e.printStackTrace(System.out);
    }
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
Struggler
  • 672
  • 2
  • 9
  • 22
  • 1
    Have a look at http://stackoverflow.com/questions/1565388/increase-heap-size-in-java, but that's really putting off the problem, why do you need all the the content of all the files in memory at once. – Tony Hopkinson Oct 12 '13 at 11:22
  • 1
    Reading this code hurts on so many levels... You should take a look into the ExecutorService, it will solve many issues for you and just works out of the box. There is no need to reinvent existing functionality. – TwoThe Oct 12 '13 at 11:24
  • @TonyHopkinson: I am currently trying to do it in pieces. Because as you pointed; having them all in memory at once isn't exactly needed - and that works. But I would still like to know an answer to this problem from a learning perspective. – Struggler Oct 13 '13 at 21:57
  • @TwoThe: Thanks. I am looking it up right now! And yes I know the code needs to be in a better shape. But I don't think the researchers were interested in maintaining code. – Struggler Oct 13 '13 at 22:02

2 Answers2

2

Try starting the JVM with a larger heap space; that is call java -Xmx=1G yourProgram. It is hard to tell only by a code snippet why is the program running out of memory. You can also user a profiler tool such as Eclipse MAT to see exactly what objects cause the memory to be full.

Random42
  • 8,989
  • 6
  • 55
  • 86
  • It's already 1G. The people at the lab use 6G (it's a bigger dataset there) and still encounter issue intermittently. – Struggler Oct 15 '13 at 09:45
2

Do not simply increase the heapsize if you do not understand the problem. Increasing the heapsize does not solve your problem. It only postpones it till it's worse (taking longer to occur).

The problem is that your program does not wait with reading in data when the heap is full. This is a simple problem. There is nothing in your algorithm which stops the reading thread from filling the heap further and further. If the processing threads can not keep up with the reading speed, the OOME must occur at some point. You have to change this: For the data reading thread, add some way that it pauses reading if a maximum number of processing threads are active, and to resume reading data when the number of processing threads goes below this threshold again.

Moreover: Maybe one of your files is corrupted and contains a very long line, e.g. > 500MB at one line. Find out if the OOME always occurs in the same line (this is very likely the case) and then check the line. What line delimeter does it have at the end, \n or \r\n? Or \r?

Daniel S.
  • 6,458
  • 4
  • 35
  • 78
  • I agree with your solution, it seems simple and logical. I think it should work. There is no corrupted line in the files. I know so because the OOME doesn't always occur after a fixed #op (I am printing stuff for debugging..). – Struggler Oct 15 '13 at 09:42