12

I need a very basic key-value store for java. I started with a HashMap but it seems that HashMap is somewhat space inefficient (I'm storing ~20 million records, and seems to require ~6GB RAM).

The map is Map<Integer,String>, and so I'm considering using GNU Trove TIntObjectHashMap<byte[]>, and storing the map value as an ascii byte array rather than String.

As an alternative to that, is there a key-value store that only requires adding jar files, does not hold the entire map in RAM at once, and is still reasonably fast?

Kevin
  • 24,871
  • 19
  • 102
  • 158
  • Just wanted to comment that I've been using Trove for some time and they seem to be quite reliable. I use them anytime I need a primitive key and/or value. – Erick Robertson Jul 10 '11 at 04:09
  • ehCache? Has automatic overflow to disk and is hashmap-like performance wise for what's in memory. No idea how it would cope with 20 million entries going to disk though. – Affe Jul 10 '11 at 04:10
  • Do you know what kind of space savings I can expect over HashMap for Trove? – Kevin Jul 10 '11 at 04:13

7 Answers7

8

BabuDB

BabuDB is an embedded non-relational database system. Its lean and simple design allows it to persistently store large amounts of key-value pairs without the overhead and complexity of similar approaches such as BerkeleyDB.

License: New BSD license, Language: Java

JDBM2

JDBM2 provides HashMap and TreeMap which are backed by disk storage.

License: Apache License 2.0, Language: Java

Banana DB

Banana DB is a self-contained key/value pair database implemented in Java.

License: Apache License 2.0, Language: Java


I've tried BabuDB and JDBM2 and they work fine. BabuDB is a little bit more difficult to set up, but potentially delivers higher performance than JDBM2.

These all all databases, which allow to persist data on disk. There are also solutions to hold a large map in memory (ehcache, hazelcast, ...).

mxro
  • 6,588
  • 4
  • 35
  • 38
  • BabuDB looks great, but I'm still looking for some Getting Started docs (asked the devs about it). – Stefan Reich Aug 03 '16 at 15:20
  • Any luck getting those docs? I had an implementation for [this interface](https://github.com/javadelight/delight-keyvalue) in Babu DB but cannot find it at the moment ... – mxro Aug 24 '16 at 05:36
5

Use Berkeley DB.

Berkeley DB stores object graphs, objects in collections, or simple binary key/value data directly in an a btree on disk. This simple, highly efficient approach removes all the unnecessary overhead in ORM solutions. Using the Direct Persistence Layer (DPL) Java developers annotate classes with storage information, much like JPA. This approach is familiar, efficient, and fast. The DPL reduces the complexity of data storage while not sacrificing speed.

This should definitely give you huge gains in memory and speed, while not increasing the complexity of your application. Enjoy!

hayesgm
  • 8,678
  • 1
  • 38
  • 40
  • I've heard that there are licensing issues with this, is there truth to that? – Kevin Jul 10 '11 at 04:21
  • @Kevin - that depends on what would be an issue for you / your product / your company. You can read about the licensing options at http://www.oracle.com/technetwork/database/berkeleydb/downloads/licensing-098979.html ... and make your own judgement. – Stephen C Jul 10 '11 at 04:30
  • 2
    tl;dr: You can use it free if 1) your project is open-source or 2) it's for internal use (non-redistributed). Otherwise, you need to obtain a license from Oracle (and they give no indication of the potential costs). – hayesgm Jul 10 '11 at 04:32
  • 1
    Ah, thanks. The open source license can be used for non-open source apps that are not distributed to 3rd parties. Perfect. – Kevin Jul 10 '11 at 04:33
  • 1
    Just a note, Oracle has recently changed their open source license to the Affero license: http://www.oracle.com/technetwork/products/berkeleydb/downloads/oslicense-093458.html – Mike Hedman Jul 08 '13 at 21:26
4

http://www.mapdb.org/ is what you are looking for. It's a rocking fast persistent implementation of java.util.Map.

Features

Concurrent

MapDB has record level locking and state-of-art concurrent engine. Its performance scales nearly linearly with number of cores. Data can be written by multiple parallel threads.

Fast

MapDB has outstanding performance rivaled only by native DBs. It is result of more than a decade of optimizations and rewrites.

ACID

MapDB optionally supports ACID transactions with full MVCC isolation. MapDB uses write-ahead-log or append-only store for great write durability.

Flexible

MapDB can be used everywhere from in-memory cache to multi-terabyte database. It also has number of options to trade durability for write performance. This makes it very easy to configure MapDB to exactly fit your needs.

Hackable

MapDB is component based, most features (instance cache, async writes, compression) are just class wrappers. It is very easy to introduce new functionality or component into MapDB.

SQL Like

MapDB was developed as faster alternative to SQL engine. It has number of features which makes transition from relational database easier: secondary indexes/collections, autoincremental sequential ID, joins, triggers, composite keys…

Low disk-space usage

MapDB has number of features (serialization, delta key packing…) to minimize disk used by its store. It also has very fast compression and custom serializers. We take disk-usage seriously and do not waste single byte.

Community
  • 1
  • 1
thmarx
  • 91
  • 4
1

Consider Koloboke Collections, which is up to 2 times faster than Trove according to various tests:

if configured to consume the same memory as Trove. Or alternatively, you can think it consumes considerably lesser memory if configured to be equally fast to Trove.


If you want to persist the map between JVM runs with very quick bootstrap, you might also be interested in Chronicle-Map which stores Strings in UTF-8 by default (so you shouldn't bother with conversions String <-> byte[] as with Koloboke/Trove). Chronicle-Map is ultra fast for persisted key-value store, but a bit slower that Koloboke and even Trove.

leventov
  • 14,760
  • 11
  • 69
  • 98
1

Just wanted to reference some more open source options that became available over time since this question was first asked.

Apache 2, BTree, Apache Directory Project JDBM replacement effort:

http://directory.apache.org/mavibot/

MPL2/EPL1, RTree, MVStore, H2 Storage Engine:

http://www.h2database.com/html/mvstore.html

Apache 2, Xodus Environments, JetBrains YouTrack and Hub storage engine:

https://github.com/JetBrains/xodus

Dieter
  • 1,154
  • 8
  • 7
0

The map is Map, and so I'm considering using GNU Trove TIntObjectHashMap, and storing the map value as an ascii byte array rather than String.

This doesn't entirely make sense because a TIntObjectHashMap is not a Map. However, the approach is sound.


Do you know what kind of space savings I can expect over HashMap for Trove?

The best answer is to try it out.

But here are some rough estimates (assuming a 32bit JVM):

  • HashMap keys would need to be Integer instances. They will occupy ~18bytes per instance + 4 bytes per reference. Total 24 bytes.

  • Trove keys would be 4 byte int values.

  • String values would be 20 bytes + 12 bytes + 2 * number of "characters".

  • Byte array values would be 12 bytes + 1 * number of "characters".

  • I haven't examined the details of the respective hash table internal data structures.

That probably amounts to around 50% memory saving, though it depends critically on the average length of the value "strings". (The longer they are, the more they will dominate the space usage.)

FWIW, Trove publish their own benchmarks here. They don't look very convincing, but you should be able to dig out their benchmark code and modify it to better match your use-case.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • 1
    TIntObjectHashMap is not a Map in the sense that it doesn't implement java.util.Map, but it does support put/get for primitives, which is what I care about. If I read correctly it uses an open-addressing hash map implementation. Or did you mean something else when you said TIntObjectHashMap is not a Map? – Kevin Jul 10 '11 at 04:26
  • @Stephen How did you calculate the memory for HashMap? BTW: trove would require 5 bytes, using 4 for int and one for the state. – Karussell Jan 11 '12 at 21:51
  • oh, trove would require 9 bytes! using 4 for the index (int[]), one for the state (byte[]) and 4 for the reference (Object[]). – Karussell Jan 11 '12 at 22:21
  • 1
    @Karussell - I didn't calculate the memory for HashMap. I *estimated* and compared the size of the keys; i.e. `Integer` instances versus `int`s. If you want to add your own Answer, please feel free to do so. – Stephen C Jan 11 '12 at 22:21
  • @StephenC ah, ok. (No, I was just wondering how one could calculate that) – Karussell Jan 12 '12 at 13:34
0

You can use Xodus KV. That is a key-value store used in production by JetBrains in the YouTrack product. It provides snapshot isolation with readers not competing with writers. JetBrains actively supports it.

Xodus also has an entity store solution which, along with ORM implemented in Kotlin, can be used as a primary database in your project.

There are plans to implement SQL language, which will allow using Xodus as the primary database for projects written in other languages.

Andrey Lomakin
  • 637
  • 3
  • 12