28

I need a disk backed Map structure to use in a Java app. It must have the following criteria:

  1. Capable of storing millions of records (even billions)
  2. Fast lookup - the majority of operations on the Map will simply to see if a key already exists. This, and 1 above are the most important criteria. There should be an effective in memory caching mechanism for frequently used keys.
  3. Persistent, but does not need to be transactional, can live with some failure. i.e. happy to synch with disk periodically, and does not need to be transactional.
  4. Capable of storing simple primitive types - but I don't need to store serialised objects.
  5. It does not need to be distributed, i.e. will run all on one machine.
  6. Simple to set up & free to use.
  7. No relational queries required

Records keys will be strings or longs. As described above reads will be much more frequent than writes, and the majority of reads will simply be to check if a key exists (i.e. will not need to read the keys associated data). Each record will be updated once only and records are not deleted.

I currently use Bdb JE but am seeking other options.


Update

Have since improved query performance on my existing BDB setup by reducing the dependency on secondary keys. Some queries required a join on two secondary keys and by combining them into a composite key I removed a level of indirection in the lookup which speeds things up nicely.

Joel
  • 29,538
  • 35
  • 110
  • 138
  • One option I am considering is changing the way I use my existing BDB implementation. Currently I have one large database for all my records. However, I should be able to partition the data up into sets and have one database per set - if I know that at any point in time I will only need access to certain sets then I can keep closed those sets I'm not using, which should help bdb manage data more efficiently for me. – Joel Oct 08 '09 at 12:44
  • i've used bdb je. for your criteria, it is a great fit. however, i was really disappointed with the fragility of it, and would not recommend it for production use. any hiccup in the java process caused the bdb subsystem to require a restart, blech! – james Oct 08 '09 at 15:24
  • I'm not sure what you mean by "the fragility" of BDB JE. BDB JE is scalable to Terabytes of data and I use it in production systems all the time. It's a wonderful piece of tech. – jasonmp85 May 30 '10 at 23:56

9 Answers9

20

JDBM3 does exactly what you are looking for. It is a library of disk backed maps with really simple API and high performance.

UPDATE

This project has now evolved into MapDB http://www.mapdb.org

Andrejs
  • 26,885
  • 12
  • 107
  • 96
6

You can try Java Chronicles from http://openhft.net/products/chronicle-map/ Chronicle Map is a high performance, off-heap, key-value, in memory, persisted data store. It works like a standard java map

Harvinder Singh
  • 681
  • 7
  • 20
  • 1
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Cyclonecode Nov 24 '14 at 22:37
  • 2
    @krister - I think this is a case where a less than ideal question generated an answer that violated SO's policy (the answer did a good job of answering the question). In this case, I'm inclined to move against the question. – jww Nov 25 '14 at 00:31
  • replication in distributed caching topology is paid feature – Amrish Pandey May 04 '21 at 02:26
6

You may want to look into OrientDB.

Juha Syrjälä
  • 33,425
  • 31
  • 131
  • 183
3

I'd likely use a local database. Like say Bdb JE or HSQLDB. May I ask what is wrong with this approach? You must have some reason to be looking for alternatives.

In response to comments: As the problem performance and I guess you are already using JDBC to handle this it might be worth trying HSQLB and reading the chapter on Memory and Disk Use.

MikeFHay
  • 8,562
  • 4
  • 31
  • 52
Michael Lloyd Lee mlk
  • 14,561
  • 3
  • 44
  • 81
  • 1
    +1 agree. I would use a regular DB and write a nice API for the requirements so that the backend can be switched easily. – flybywire Oct 08 '09 at 10:40
  • Once Bdb reaches the limits of what can be cached in memory i'm finding that it slows down unacceptably. This generally happens after about 1mm inserts. – Joel Oct 08 '09 at 10:46
  • How about HSQLDB? I'm going to guess they both JDBC so you should be able to slot it in without modifying much of your existing code. Would be worth reading: http://hsqldb.org/doc/2.0/guide/deployment-chapt.html#deployment_mem_disk-sect – Michael Lloyd Lee mlk Oct 08 '09 at 11:24
  • 2
    BDBs slow down once you hit the point that you're thrashing your cache. BDBs essentially have a BTree in memory which tries to answer a request. If the request cannot be answered, the BDB pages in more data from disk. Once your working set is larger than your cache, you'll find trouble. There are JMX methods for monitoring the cache hit misses and cache size: use them to debug your application and if necessary increase the heap and give BDB more cache. – jasonmp85 May 30 '10 at 23:58
  • 4
    Also HSQLDB is **not** an acceptable solution. While it can store a lot of data on disk, it does **not** stream that data from disk when doing reads. It reads the entire `ResultSet` into memory rather than paging it in as you iterate through it. If you ever need to walk over a large portion of a table this will blow out your memory. BDBs handle this just fine. I also believe the the h2 database (http://www.h2database.com/html/main.html ) claims to solve this, though I've never used it. – jasonmp85 May 31 '10 at 00:00
  • @jasonmp85 - this is exactly what i've found - once the BDB BTree no longer fits in memory you're in trouble. – Joel Jun 25 '10 at 14:58
3

As of today I would either use MapDB (file based/backed sync or async) or Hazelcast. On the later you will have to implement you own persistency i.e. backed by a RDBMS by implementing a Java interface. OpenHFT chronicle might be an other option. I am not sure how persistency works there since I never used it, but the claim to have one. OpenHFT is completely off heap and allows partial updates of objects (of primitives) without (de-)serialization, which might be a performance benefit.

NOTE: If you need your map disk based because of memory issues the easiest option is MapDB. Hazelcast could be used as a cache (distributed or not) which allows you to evict elements from heap after time or size. OpenHFT is off heap and could be considered if you only need persistency for jvm restarts.

KIC
  • 5,887
  • 7
  • 58
  • 98
1

SQLite does this. I wrote a wrapper for using it from Java: http://zentus.com/sqlitejdbc

As I mentioned in a comment, I have successfully used SQLite with gigabytes of data and tables of hundreds of millions of rows. If you think out the indexing properly, it's very fast.

The only pain is the JDBC interface. Compared to a simple HashMap, it is clunky. I often end up writing a JDBC-wrapper for the specific project, which can add up to a lot of boilerplate code.

Sam Dufel
  • 17,560
  • 3
  • 48
  • 51
David Crawshaw
  • 10,427
  • 6
  • 37
  • 39
  • 1
    I have successfully used SQLite with gigabytes of data and tables of hundreds of millions of rows. If you think out the indexing properly, it's very fast. – David Crawshaw Oct 08 '09 at 22:44
1

I've found Tokyo Cabinet to be a simple persistent Hash/Map, and fast to set-up and use.

This abbreviated example, taken from the docs, shows how simple it is to save and retrieve data from a persistent map:

    // create the object
    HDB hdb = new HDB();
    // open the database
    hdb.open("casket.tch", HDB.OWRITER | HDB.OCREAT);
    // add item 
    hdb.put("foo", "hop");
    hdb.close();
Joel
  • 29,538
  • 35
  • 110
  • 138
0

I think Hibernate Shards may easily fulfill all your requirements.

Ray Hulha
  • 10,701
  • 5
  • 53
  • 53
Boris Pavlović
  • 63,078
  • 28
  • 122
  • 148
0

JBoss (tree) Cache is a great option. You can use it standalone from JBoss. Very robust, performant, and flexible.

james
  • 1,379
  • 7
  • 6