10

In my Java code, I am using Guava's Multimap (com.google.common.collect.Multimap) by using this:

 Multimap<Integer, Integer> Index = HashMultimap.create()

Here, Multimap key is some portion of a URL and value is another portion of the URL (converted into an integer). Now, I assign my JVM 2560 Mb (2.5 GB) heap space (by using Xmx and Xms). However, it can only store 9 millions of such (key,value) pairs of integers (approx 10 million). But, theoretically (according to memory occupied by int) it should store more.

Can anybody help me,

  1. Why is Multimap using lots of memory? I checked my code and without inserting pairs into the Multimap, it only uses 1/2 MB of memory.
  2. 2.

Is there another way or home-baked solution to solve this memory issue? Means, Is there any way to reduce those object overheads as I want to store only int-int? In any other language ? Or any other solution (home-baked preferred) to solve issue I faced, means DB based or something like that solution.

Arpssss
  • 3,850
  • 6
  • 36
  • 80
  • 5
    Did you take object overhead into account when working out the "theoretical" size? You're using Integer, not int... – Jon Skeet Mar 29 '12 at 17:28
  • @JonSkeet, I am really sorry. Thanks. – Arpssss Mar 29 '12 at 17:30
  • No problem - and of course you *can't* use `int` as a generic type argument in Java. You probably want a dedicated int to int multimap to avoid the overhead :( – Jon Skeet Mar 29 '12 at 17:31
  • @JonSkeet, Is there any way to reduce those object overheads as I want to store only int-int? In any other language ? Or any other solution (home-baked preferred) to solve issue I faced means DB based or something like that solution. – Arpssss Mar 29 '12 at 17:56
  • In a CLR language, you'd be able to write the equivalent of Multimap in a generic form without the overhead, yes. I would imagine the same is true for various "native" languages too. In the JVM you'd have to have a specific type just for `int` to `int`. – Jon Skeet Mar 29 '12 at 17:59
  • 1
    You may also find this of interest: http://code.google.com/p/memory-measurer/wiki/ElementCostInDataStructures – Dimitris Andreou Apr 02 '12 at 19:44

4 Answers4

11

There's a huge amount of overhead associated with Multimap. At a minimum:

  • Each key and value is an Integer object, which (at a minimum) doubles the storage requirements of each int value.
  • Each unique key value in the HashMultimap is associated with a Collection of values (according to the source, the Collection is a Hashset).
  • Each Hashset is created with default space for 8 values.

So each key/value pair requires (at a minimum) perhaps an order of magnitude more space than you might expect for two int values. (Somewhat less when multiple values are stored under a single key.) I would expect 10 million key/value pairs to take perhaps 400MB.

Although you have 2.5GB of heap space, I wouldn't be all that surprised if that's not enough. The above estimate is, I think, on the low side. Plus, it only accounts for how much is needed to store the map once it is built. As the map grows, the table needs to be reallocated and rehashed, which temporarily at least doubles the amount of space used. Finally, all this assumes that int values and object references require 4 bytes. If the JVM is using 64-bit addressing, the byte count probably doubles.

Ted Hopp
  • 232,168
  • 48
  • 399
  • 521
  • Thanks. I got it. But is there any way to reduce those above listed overheads as I want to store only int-int? In any other language ? Or any other solution (home-baked preferred) to solve issue I faced means DB based or something like that solution. – Arpssss Mar 29 '12 at 17:55
  • @Arpssss There's [this list maintained by NIST](http://math.nist.gov/javanumerics/) of Java numerics libraries. Many of them support primitive collections, but I don't know of any that support multimaps. According to some studies (e.g., [this one](http://b010.blogspot.com/2009/05/speed-comparison-of-1-javas-built-in.html)), these libraries are vastly faster, but don't save all that much space. Take a look also at [this thread](http://stackoverflow.com/questions/3307622/java-primitive-collections-library). – Ted Hopp Mar 29 '12 at 18:07
  • FYI, I'm on the Guava team and working on a long-term project to reduce the large memory overhead of e.g. `HashMultimap`. So it'll get improved in the future. That said, `HashSet` on its own has _huge_ memory overhead. – Louis Wasserman Mar 29 '12 at 18:15
  • @LouisWasserman Since Hashset is part of the JFC, is the plan to simply avoid using it? It would be interesting to see what the alternative would be. – Ted Hopp Mar 29 '12 at 18:39
  • 2
    `HashSet` as written is a well-known memory hog that just internally delegates to a `HashMap`, with all the extra memory overhead of having a value attached to each key. I'm doing a Guava-internal rewrite, basically. – Louis Wasserman Mar 29 '12 at 19:24
  • "Each Hashset is created with default space for 8 values.". Note that you could reduce this amount by using the alternative factory method: `HashMultimap.create(int expectedKeys, int expectedValuesPerKey)` – Etienne Neveu Mar 30 '12 at 08:26
  • 2
    Louis, surprising as it may sound, there is no extra overhead due to the unnecessary "value" field per entry, due to alignment. (I've already tried to copy HashMap's code, drop "value", and make a HashSet, and the memory footprint was exactly the same, modulo a constant handful of words) – Dimitris Andreou Apr 02 '12 at 19:48
  • 2
    @eneveu, fyi in the next release, the default of 8 becomes a default of 2! – Dimitris Andreou Apr 02 '12 at 19:50
5

Probably the simplest way to minimize the memory overhead would be to potentially mix Trove's primitive collection implementations (to avoid memory overhead of boxing) and Guava's Multimap, something like

SetMultimap<Integer, Integer> multimap = Multimaps.newSetMultimap(
  TDecorators.wrap(TIntObjectHashMap<Collection<Integer>>()),
  new Supplier<Set<Integer>>() {
    public Set<Integer> get() {
      return TDecorators.wrap(new TIntHashSet());
    }
  });

That still has the overhead of boxing and unboxing on queries, but the memory it consumes just sitting there would be significantly reduced.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • Nice way to solve. However, Is not it create performance issue ? – Arpssss Mar 29 '12 at 19:52
  • It shouldn't create any more of a performance issue than you would experience with a normal `Multimap`, and would involve _much, much_ less of a performance penalty than your proposed alternate file-backed solution. – Louis Wasserman Mar 29 '12 at 21:52
  • 1
    +1 for Trove. Note that the code could be shortened thanks to the [TDecorators](http://trove4j.sourceforge.net/javadocs/gnu/trove/TDecorators.html) static factory methods. Thanks to type inference, `new TIntObjectMapDecorator>(new TIntObjectHashMap>())` becomes `TDecorators.wrap(new TIntObjectHashMap>())`. – Etienne Neveu Apr 01 '12 at 00:23
1

It sounds like you need a sparse boolean matrix. Sparse matrices / arrays in Java should provide pointers to library code. Then instead of putting (i, j) into the multimap, just put a 1 into the matrix at [i][j].

Community
  • 1
  • 1
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • Yeesh. With this scheme, it would be expensive to return all the values for a given key. You'd have to scan an entire matrix row and collect the indices of the columns containing 1. – Ted Hopp Mar 29 '12 at 18:09
  • 2
    @TedHopp, no. That's true in a dense matrix but not in a sparse matrix. A good sparse matrix class should have a way to enumerate the non-zero cells. In scipy you use [`nonzero`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.nonzero.html#scipy-sparse-lil-matrix-nonzero) for this. – Mike Samuel Mar 29 '12 at 18:11
  • Point taken in principle -- there's probably a way to use a row slice view of a sparse matrix to get the values. Then something like `nonzero` could retrieve the desired list of values. However, the Java libraries that I know of don't have analogues to `nonzero`. – Ted Hopp Mar 29 '12 at 18:51
  • @TedHopp, [Colt](http://acs.lbl.gov/software/colt/api/cern/colt/matrix/impl/SparseDoubleMatrix2D.html#forEachNonZero%28cern.colt.function.IntIntDoubleFunction%29) does. – Mike Samuel Mar 30 '12 at 13:18
0

You could use probably an ArrayListMultimap, which requires less memory than a HashMultimap, since ArrayLists are smaller than HashSets. Or, you could modify Louis's Trove solution, replacing the Set with a List, to reduce memory usage further.

Some applications depend on the fact that HashMultimap satisfies the SetMultimap interface, but most don't.

Jared Levy
  • 1,986
  • 13
  • 12