7

I am evaluating different data from a textfile in a rather large algorithm.

If the text file contains more than datapoints (the minimum I need is sth. like 1.3 million datapoints) it gives the following error:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
   at java.util.regex.Matcher.<init>(Unknown Source)
   at java.util.regex.Pattern.matcher(Unknown Source)
   at java.lang.String.replaceAll(Unknown Source)
   at java.util.Scanner.processFloatToken(Unknown Source)
   at java.util.Scanner.nextDouble(Unknown Source)

When I'm running it in Eclipse with the following settings for the installed jre6 (standard VM):

-Xms20m -Xmx1024m -XX:MinHeapFreeRatio=20 -XX:MaxHeapFreeRatio=40 -XX:NewSize=10m 
-XX:MaxNewSize=10m -XX:SurvivorRatio=6 -XX:TargetSurvivorRatio=80 
-XX:+CMSClassUnloadingEnabled

Note that it works fine if I only run through part of the textfile.

Now I've read a lot about this subject and it seems that somewhere I must have either a data leak or I'm storing too much data in arrays (which I think I do).

Now my problem is: how can I work around this? Is it possible to change my settings such that I can still perform the computation or do I really need more computational power?

Jean-Paul
  • 19,910
  • 9
  • 62
  • 88
  • How can we be certain you know exactly what it means? All we have is that you *think* so. – Marko Topolnik May 31 '13 at 20:12
  • I read this: http://stackoverflow.com/questions/1393486/what-does-the-error-message-java-lang-outofmemoryerror-gc-overhead-limit-excee – Jean-Paul May 31 '13 at 20:13
  • 2
    I think you shall be requiring the services of a profiler for this. I especially recommend visualgc. – Marko Topolnik May 31 '13 at 20:19
  • What does a profiler exactly do? I've never used it before.. – Jean-Paul May 31 '13 at 20:20
  • 1
    Specifically, visualgc visualizes in real time all the heap generations. You see exactly and intuitively what's going on with every aspect of allocation and GC. It allows you to quickly formulate hypotheses about what may be going wrong. – Marko Topolnik May 31 '13 at 20:26
  • Do you have some sample, huge data? This is quite an interesting problem, but duplicating the data for testing a solution can be a problem... – fge May 31 '13 at 20:42
  • @fge: I totally agree. I do have sample data but I will have to alter it a bit to keep our research protected. I will have a go at it today. – Jean-Paul Jun 01 '13 at 09:16

3 Answers3

3

The really critical vm arg is -Xmx1024m, which tells the VM to use up to 1024 megabytes of memory. The simplest solution is to use a bigger number there. You can try -Xmx2048m or -Xmx4096m, or any number, assuming you have enough RAM in your machine to handle it.

I'm not sure you're getting much benefit out of any of the other VM args. For the most part, if you tell Java how much space to use, it will be smart with the rest of the params. I'd suggest removing everything except the -Xmx param and seeing how that performs.

A better solution is to try to improve your algorithm, but I haven't yet read through it in enough detail to offer any suggestions.

Eric Grunzke
  • 1,487
  • 15
  • 21
  • That seems to make sense. So I have about 4 RAM. So that means I should be able to increase -Xmx to about 2048? I will try it tomorrow and let you know if it worked. (It's evening here) – Jean-Paul May 31 '13 at 20:19
  • 2
    Correct. If you're lucky, that will be enough for your dataset and you won't need to bother with more difficult / time consuming changes. With 4GB total, you could probably get up to 3GB in your vm, though you may need to close some other programs. – Eric Grunzke May 31 '13 at 20:22
  • If it works I will give the points to you for a very short but efficient solution – Jean-Paul May 31 '13 at 20:39
  • It worked! I didn't know that increasing the memory can solve this error. I thought it only had to do with the heap size errors! Thank you a lot for your answer! – Jean-Paul Jun 01 '13 at 11:31
3

As you are saying that the data size is really very large, if it does not fit in one computers memory even after using -Xmx jvm argument, then you may want to move to cluster computing, using many computers working on your problem. For this you will have to use Message Passing Interface (MPI).

MPJ Express is a very good implementation of MPI for Java, or in languages like C/C++ there are some good implementations for MPI existing like Open MPI and mpich2. I am not sure whether it will help you in this situation, but certainly will help you in future projects.

Sourabh Bhat
  • 1,793
  • 16
  • 21
1

I suggest you

  • use a profiler to minimize your memory usage. I suspect you can reduce it by a factor of 10x or more by using primitives, binary data, and more compact collections.
  • increase your memory in your machine. The last time I did back testing of hundreds of signals I had 256 GB of main memory and this was barely enough at times. The more memory you can get the better.
  • use memory mapped files to increase memory efficiency.
  • Reduce the size of your data set to sometime you machine and program can support.
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • what do you mean with '256 GB of main memory'. – Jean-Paul Jun 01 '13 at 11:33
  • The machine has 256 GB of memory and using memory mapped files I was using almost all of it. – Peter Lawrey Jun 01 '13 at 11:34
  • Wow! That must have been a very big project then. No my biggest file (a .txt file which serves as a database) is about 70 mb so I'm fine. I solved my problem though, being simpler than I thought: I merely had to increase the max memory Eclipse was allowed to use (even though I already put it at 1024m). I am interested in these 'memory mapped files' so I'll read into that for future usage. Thank you for your time and answer! – Jean-Paul Jun 01 '13 at 11:38