I'm using the mstor library to parse an mbox mail file. Some of the files exceed a gigabyte in size. As you can imagine, this can cause some heap space issues.
There's a loop that, for each iteration, retrieves a particular message. The getMessage()
call is what is trying to allocate heap space when it runs out. If I add a call to System.gc()
at the top of this loop, the program parses the large files without error, but I realize that collecting garbage 40,000 times has to be slowing the program down.
My first attempt was to make the call look like if (i % 500 == 0) System.gc()
to make the call happen every 500 records. I tried raising and lowering this number, but the results are inconsistent and generally return an OutOfMemory error.
My second, more clever attempt looks like this:
try {
message = inbox.getMessage(i);
} catch (OutOfMemoryError e) {
if (firstTry) {
i--;
firstTry = false;
} else {
firstTry = true;
System.out.println("Message " + i + " skipped.");
}
System.gc();
continue;
}
The idea is to only call the garbage collector if an OutOfMemory error is thrown, and then decrement the count to try again. Unfortunately, after parsing several thousand e-mails the program just starts outputting:
Message 7030 skipped.
Message 7031 skipped.
....
and so on for the rest of them.
I'm just confused as to how hitting the collector for each iteration would return different results than this. From my understanding, garbage is garbage, and all this should be changing is how much is collected at a given time.
Can anyone explain this odd behavior? Does anyone have recommendations for other ways to call the collector less frequently? My heap space is maxed out.