Issue with Hash Map Space

Question

In my Java code, I am using Guava's Multimap (com.google.common.collect.Multimap) by using this:

 Multimap<Integer, Integer> Index = HashMultimap.create()

Here, Multimap key is some portion of a URL and value is another portion of the URL (converted into an integer). Now, I assign my JVM 2560 Mb (2.5 GB) heap space (by using Xmx and Xms). However, it can only store 9 millions of such (key,value) pairs of integers (approx 10 million). Now, issue is, I can provide JVM only limited amount of memory (say 2 GB).

So, can anybody help me,

1) Is there another way or home-baked solution to solve this memory issue? Means, Is Disk/DB Based Multi-Map can be a nice solution ? I read from some web articles that there is some DB/Disk based solution to solve this issue ex. Berkley DB or Ehcache. Can anybody inform me whether (or which one) is faster ?

2) Is those Disk/DB Based Multi-Map has performance issue (I am asking for both storing and searching) ?

3) Any idea or information how to use those in brief.

4) Any other idea will be nice for me.

NB: I want Multimap (key can have multiple values)solutions for the above issue. And I have to consider performance of storing and searching also.

May I ask why you want to do this? For that many items, you can use a plain relational database, with an index configured on your key column. — vgru, Mar 29 '12 at 18:54
@Groo, I have more than 100 millions of key value pairs. And I want a nice fast way to store and search. — Arpssss, Mar 29 '12 at 18:57
FYI, I suggested an answer to your original question that might let you keep using Guava's `Multimap` with reduced space overhead. — Louis Wasserman, Mar 29 '12 at 19:33
@Arpssss: if you put them in a db table, and create an index on the key column, seeking a row will be very fast. For rows as small as this, you can probably make your key column a primary key column with a clustered index, to achieve O(1) seek time. — vgru, Mar 29 '12 at 20:48

score 2 · Accepted Answer · answered Mar 29 '12 at 20:05

2

You certainly won't store 100 million pairs of Integer objects in 2.5 GB of memory. If I'm not mistaken, an Integer will use at least 16 bytes of memory in Oracle/Sun JVM (and the alignment is also 16 bytes), which means 3.2 GB of memory for the Integers alone, without any structure.

With this data size you should definitely go with something which is backed by the disk, or use a server with lots of memory and/or optimized data structures (in particular try to avoid primitive type wrappers). I have used H2 for similar tasks and found it quite good (it can use mapped files to access the disk instead of reads), but I don't have any comparison with other similar libraries.

answered Mar 29 '12 at 20:05

Michał Kosmulski

9,855
1
32
51

Thanks. But, can it be used to store single key with multiple values ? Can you little bit elaborate your answer by providing simple code for how to use it ? From your experience, is it faster ? – Arpssss Mar 29 '12 at 20:09
The API is via JDBC (there's an alternative, faster API also if you need a huge number of transactions per second). So, essentially, you code like for an SQL database, meaning multiple values need to either be represented by relations and multiple tables or somehow encoded into a single value (which is usually less elegant). As for speed, I haven't compared it against competition, other factors were crucial. It will certainly be much slower than an in-memory map. You could look for specialized structures or try to roll your own, based e.g. on Trove (very good, but regular maps, no multimap). – Michał Kosmulski Mar 29 '12 at 20:22
Small addition on the 16 bytes per Integer, as rightly explained above: Given the amount of data you're talking about, you'll probably end up with a 64-bit VM. And an Integer would actually use 24 bytes. Reason being the Object header already requires 2 word (2 x 64 bits) and then the int (32 bits), but given how memory is increased and object alignment, you end up with 24 bytes... Object alignment being 8 bits as far as I know on HotSpot. (16 on JRockit 64 bits with 64GB compressed refs?). Anyways, anything between 3 and 4.5 GB for 200 millions Integer without any structure to contain them! – Alex Snaps Apr 03 '12 at 15:52

Andrejs · Answer 2 · 2012-03-31T08:05:48.510

2

JDBM3 is a very fast on-disk HashMap/TreeMap (B+Tree) library and is claimed to be 4x faster than berkeley db. Billions of records can be stored in the map. It does caching internally so map operations won't be slowing down because of disk access.

DB db = DBMaker.openFile(fileName).make();
Map<Integer,Integer> map = db.createHashMap("mapName");
map.put(5, 10);
db.close()

It does not have a Multimap but the value can be a Set/List.

edited Mar 31 '12 at 08:05

answered Mar 30 '12 at 19:34

Andrejs

26,885
12
107
96

Thanks. Is it faster than Kyoto cabinet ? Is it nice for large database (like for billions ) ? – Arpssss Mar 31 '12 at 00:01
Another point is that: is any other structure for it (like B Tree or like that R Tree) which allows duplicates means keys with multiple values ? – Arpssss Mar 31 '12 at 00:03
It does have a B+Tree structure with multiple keys stored on single tree node and supports billions of records. The website says that is slower than Tokyo cabinet but its probably the fastest pure hava solution – Andrejs Mar 31 '12 at 08:26

Issue with Hash Map Space

2 Answers2

Linked