1

My java class reads in a 60MB file and produces a HashMap of a HashMap with over 300 million records.

HashMap<Integer, HashMap<Integer, Double>> pairWise =
                             new HashMap<Integer, HashMap<Integer, Double>>();

I already tunned the VM argument to be:

-Xms512M -Xmx2048M

But system still goes for:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.HashMap.createEntry(HashMap.java:869)
    at java.util.HashMap.addEntry(HashMap.java:856)
    at java.util.HashMap.put(HashMap.java:484)
    at com.Kaggle.baseline.BaselineNew.createSimMap(BaselineNew.java:70)
    at com.Kaggle.baseline.BaselineNew.<init>(BaselineNew.java:25)
    at com.Kaggle.baseline.BaselineNew.main(BaselineNew.java:315)

How big of the heap will it take to run without failing with an OOME?

Bhesh Gurung
  • 50,430
  • 22
  • 93
  • 142
  • 1
    Trying to build a 300mio x 300mio similarity matrix is a pretty bad idea, even if you have 2gb of heap. – Thomas Jungblut Apr 25 '14 at 14:58
  • Using a simplified test I measure something like 90 bytes for a `Map.Entry` so 300 million entries require at least 25GB (probably considerably more than that since you'll also have a bunch of `Map.Entry>` and other things in memory) – SamYonnou Apr 25 '14 at 15:28

3 Answers3

5

Your dataset is ridiculously large to process it in memory, this is not a final solution, just an optimization.

You're using boxed primitives, which is a very painful thing to look at. According to this question, a boxed integer can be 20 bytes larger than an unboxed integer. This is not what I call memory efficient.

You can optimize this with specialized collections, which don't box the primitive values. One project providing these is Trove. You could use a TIntDoubleMap instead of your HashMap<Integer, Double> and a TIntObjectHashMap instead of your HashMap<Integer, …>.

Therefore your type would look like this:

TIntObjectHashMap<TIntDoubleHashMap> pairWise =
                         new TIntObjectHashMap<TIntDoubleHashMap>();

Now, do the math.

300.000.000 Doubles, each 24 bytes, use 7.200.000.000 bytes of memory, that is 7.2 GB.
If you store 300.000.000 doubles, taking 4 bytes each, you only need 1.200.000.000 bytes, which is 1.2 GB.
Congrats, you saved around 83% of the memory you previously used for storing your numbers!

Note that this calculation is rough, depends on the platform and implementation, and does not account for the memory used for the HashMap/T*Maps.

Community
  • 1
  • 1
tilpner
  • 4,351
  • 2
  • 22
  • 45
  • Thx, it seeems a good solution. I downloaded the jar. But it seems cannot imported as other jar libraries? Do you know how to do this? – Xingsheng Guo Apr 25 '14 at 15:48
  • I'm not sure I understand your intention, neither do I know what environment you develop in. I assume Eclipse or IntelliJ, and in both of these it should work straightforward. I cannot help you without further information. – tilpner Apr 25 '14 at 15:51
  • Sorry for the confusion. I use Eclipse. The jar file of trove-3.0.3-src.jar was imported to the project as usual from build path. When creat a TIntObjectHashMap, there is no package to import... – Xingsheng Guo Apr 25 '14 at 15:55
  • Download [this](https://bitbucket.org/robeden/trove/downloads/trove-3.1a1.zip), unzip, `cd` there, call `ant` and the needed files are generated. You can now proceed packing these or copying these. – tilpner Apr 25 '14 at 15:58
  • Oh, it seems you can also call `ant dist` and copy the resulting `.jar` from `/output/dist/…/trove-…-src.jar`. – tilpner Apr 25 '14 at 16:00
  • Okay... It seems I need Ant for Mac first. I will try this tonight. Just to make sure that, once I got the files after the ant command, I will do the normal import steps in Eclipse, right? – Xingsheng Guo Apr 25 '14 at 16:07
  • I don't use Eclipse, but I think you it should work like that. – tilpner Apr 25 '14 at 16:08
3

Your data set is large enough that holding all of it in memory at one time is not going to happen.

Consider storing the data in a database and loading partial data sets to perform manipulation.

Edit: My assumption was that you were going to do more than one pass on the data. If all you are doing is loading it and performing one action on each item, then Lex Webb's suggestion (comment below) is a better solution than a database. If you are performing more than one action per item, then database appears to be a better solution. The database does not need to be an SQL database, if your data is record oriented a NoSQL database might be a better fit.

DwB
  • 37,124
  • 11
  • 56
  • 82
  • 2
    Alternatively just don't read the whole file at once, if possible, read and process the information bit by bit. – Lex Webb Apr 25 '14 at 14:57
1

You are using the wrong data structures for data of this volume. Java adds significant overhead in memory and time for every object it creates -- and at the 300 million object level you're looking at a lot of overhead. You should consider leaving this data in the file and use random access techniques to address it in place -- take a look at memory mapped files using nio.

Software Engineer
  • 15,457
  • 7
  • 74
  • 102