0

I'm using the mstor library to parse an mbox mail file. Some of the files exceed a gigabyte in size. As you can imagine, this can cause some heap space issues.

There's a loop that, for each iteration, retrieves a particular message. The getMessage() call is what is trying to allocate heap space when it runs out. If I add a call to System.gc() at the top of this loop, the program parses the large files without error, but I realize that collecting garbage 40,000 times has to be slowing the program down.

My first attempt was to make the call look like if (i % 500 == 0) System.gc() to make the call happen every 500 records. I tried raising and lowering this number, but the results are inconsistent and generally return an OutOfMemory error.

My second, more clever attempt looks like this:

try {
    message = inbox.getMessage(i);
} catch (OutOfMemoryError e) {
    if (firstTry) {
        i--;
        firstTry = false;
    } else {
        firstTry = true;
        System.out.println("Message " + i + " skipped.");
    }
    System.gc();
    continue;
}

The idea is to only call the garbage collector if an OutOfMemory error is thrown, and then decrement the count to try again. Unfortunately, after parsing several thousand e-mails the program just starts outputting:

 Message 7030 skipped.
 Message 7031 skipped.
 ....

and so on for the rest of them.

I'm just confused as to how hitting the collector for each iteration would return different results than this. From my understanding, garbage is garbage, and all this should be changing is how much is collected at a given time.

Can anyone explain this odd behavior? Does anyone have recommendations for other ways to call the collector less frequently? My heap space is maxed out.

trincot
  • 317,000
  • 35
  • 244
  • 286
Jacob Ensor
  • 335
  • 3
  • 12
  • 3
    Have you tried increasing your heap space? – Vivin Paliath Jun 21 '12 at 16:54
  • @VivinPaliath His heap space is maxed out is the last sentence of the question... – fvu Jun 21 '12 at 16:56
  • 1
    @fvu You can increase the heap space using `-Xmx`. Unless he means that his machine doesn't have enough RAM to do that. – Vivin Paliath Jun 21 '12 at 16:58
  • Do you need to read the entire file at once? – John Kane Jun 21 '12 at 16:58
  • 2
    Show more of your code, you shouldn't have to call GC at all. – Robin Jun 21 '12 at 17:01
  • Note that catching errors in Java is generally bad practice. – arshajii Jun 21 '12 at 17:03
  • Really? in Java there are checked exceptions which you are forced to catch (its annoying... but built into the language). Though, its better to try to avoid code errors when possible if thats what you meant. – John Kane Jun 21 '12 at 17:04
  • What are you doing with the parsed message in your loop? This could be part of the issue. – John Kane Jun 21 '12 at 17:06
  • @JohnKane `Error` was created so that they wouldn't be swallowed when you did a pokemon-style catch, i.e., `catch(Exception e)`, since that catches unchecked exceptions also. – Vivin Paliath Jun 21 '12 at 17:48
  • @Vivin Paliath when you say Error was created do you mean Exception? – John Kane Jun 21 '12 at 17:58
  • @JohnKane `Exception` is different from [`Error`](http://docs.oracle.com/javase/6/docs/api/java/lang/Error.html). It's a subclass of `Throwable` and generally indicates a serious fault or condition that the application is not expected to catch. – Vivin Paliath Jun 21 '12 at 18:00
  • @Vivin Paliath I know that Error is Different from Exception. Sorry I read the comment about catching errors as exceptions (I need to sleep more). – John Kane Jun 21 '12 at 18:03

5 Answers5

1

You should not rely on System.gc() as it can be ignored by VM. If you get OutOfMemory it means VM already tried to run GC. You can try increasing heap size, changing sizes of generations in heap (say most of your objects end up in old generation, then you don't need much memory for young generation), review your code to make sure you are not holding any references to resources you don't need.

Andy
  • 1,618
  • 11
  • 13
1

Calling System.gc() is a waste of time in the general sense, it doesn't guarantee to do anything at anytime, it is a suggestion at best and in most cases is ignored. Calling it after an OutOfMemoryException is even more useless, because the JVM has already tried to reclaim memory before the exception was thrown.

The only thing you can do if you are using third party code you can't control is increase the JVM heap allocation at the command line to the most that your particular machine can handle.

Get started with java JVM memory (heap, stack, -xss -xms -xmx -xmn...)

1

Here are my suggestions:

  • Increase heap space. This is probably the easiest thing to do. You can do this with the -Xmx. parameter.
  • See if the API to load messages provides a "streaming" option. Perhaps you don't need to load the entire message into memory at once.

Calling System.gc() won't do you any good because it doesn't guarantee that the GC will be called. In effect, it is a sure sign of bad code. If you're depending on System.gc() for your code to work, then your code is probably broken. In this case you seem to be relying on it for performance's sake and that is a sign that your code is definitely broken.

You can never be sure that the JVM will honor your request, and you can't tell how it will perform the garbage collection either. The JVM may decide to ignore your request completely (i.e., it is not a guarantee). Whether System.gc() will do what it's supposed to, is pretty iffy. Since its behavior is not guaranteed, it is better to not use it altogether.

Finally, you can disable explicit calls to System.gc() by using the -XX:DisableExplicitGC option, which means that again, it is not guaranteed that your System.gc() call will run because it might be running on a JVM that has been configured to ignore that explicit call.

Vivin Paliath
  • 94,126
  • 40
  • 223
  • 295
  • 1
    Your post was the most motivating, thanks for pointing me in the right direction! See my answer for the final solution. – Jacob Ensor Jun 22 '12 at 20:45
1

By default mstor will cache messages retrieved from a folder in an ehcache cache for faster access. This caching may be disabled however, and I would recommend disabling it for large folders.

You can disable caching by creating a text file called 'mstor.properties' in the root of your classpath with the following content:

mstor.cache.disabled=true

You can also set this value as a system property:

java -Dmstor.cache.disabled=true SomeProgram
fortuna
  • 701
  • 5
  • 7
0

The mstor library wasn't handling the caching of messages well. After doing some research I found that if you call Folder.close() (inbox is my folder object above) mstor and javaxmail releases all of the messages that were cached as a result of the getMessage() method.

I made the try/catch block look like this:

try {
    message = inbox.getMessage(i);
    // moved all of my calls to message.getFrom(),
    // message.getAllRecipients(), etc. inside this try/catch.
} catch (OutOfMemoryError e) {
    if (firstTry) {
        i--;
        firstTry = false;
    } else {
        firstTry = true;
        System.out.println("Message " + i + " skipped.");
    }
    inbox.close(false);
    System.gc();
    inbox.open(Folder.READ_ONLY);
    continue;
}
firstTry = true;

Each time the catch statement is hit, it takes 40-50 ms to manually clear the cached messages and re-open the folder.

With calling the garbage collector through every iteration, it took 57 minutes to parse a 1.6 gigabyte file. With this logic, it takes only 18 minutes to parse the same file.

Update - Another important aspect in lowering the amount of memory used by mstor is in the cache properties. Somebody else already mentioned setting "mstor.cache.disabled" to true, and this helped. Today I discovered another important property that greatly reduced the amount of OOM catches for even larger files.

    Properties props = new Properties();
    props.setProperty("mstor.mbox.metadataStrategy", "none");
    props.setProperty("mstor.cache.disabled", "true");
    props.setProperty("mstor.mbox.cacheBuffers", "false");   // most important
Jacob Ensor
  • 335
  • 3
  • 12
  • 1
    Hmm, you shouldn't be catching `OutOfMemoryError` and making an explicit call to `System.gc()`. Those are very bad code-smells. There should be a way in mstor to control the caching behavior. – Vivin Paliath Jun 22 '12 at 21:20
  • I've tweaked the mstor caching properties to no avail. I could easily have the folders open and close every 500 or 1000 iterations, but I imagine that would be slower than this. I would think this is a good exception to the rule. If there weren't exceptions, we wouldn't have access to things like the garbage collector. This method ensures I only do the resource-intensive opening and closing of the Folder object as needed. With smaller files, it won't even be called. Thoughts? – Jacob Ensor Jun 22 '12 at 22:17
  • That seems pretty odd. I wonder if it's a limitation of the API then. I guess you could always email the mstor guys and ask them the best way to handle this kind of situation. – Vivin Paliath Jun 22 '12 at 23:01
  • Solid idea, will do. I'll follow up if I hear anything back. Thanks for your help man! – Jacob Ensor Jun 23 '12 at 02:16
  • The developer's e-mails aren't listed, but I added a topic to their SourceForge help forum. Here's the link if you're interested to follow (though they may just post in here): https://sourceforge.net/projects/mstor/forums/forum/390660/topic/5375846 – Jacob Ensor Jun 23 '12 at 17:43
  • that `System.gc()` still isn't actually doing anything to release any memory it is just costing you time. –  Jul 03 '12 at 16:21
  • Do you mind spending some time justifying your input? "Some guy on StackOverflow said so" doesn't really do it for me. – Jacob Ensor Jul 04 '12 at 07:57
  • http://stackoverflow.com/questions/2414105/why-is-it-a-bad-practice-to-call-system-gc – avgvstvs Jul 05 '12 at 17:44
  • Go to section 31 for an admonishment from Sun to NEVER call System.gc() http://java.sun.com/docs/hotspot/gc1.4.2/faq.html – avgvstvs Jul 05 '12 at 17:56