Java Fast Data Storage & Retrieval

Question

I need to store records into a persistant storage and retrieve it on demand. The requirement is as follows:

Extremely fast retrieval and insertion
Each record will have a unique key. This key will be used to retrieve the record
The data stored should be persistent i.e. should be available upon JVM restart
A separate process would move stale records to RDBMS once a day

What do you guys think? I cannot use standard database because of latency issues. Memory databases like HSQLDB/ H2 have performace contraints. Moreover the records are simple string objects and do not qualify for SQL. I am thinking of some kind of flat file based solution. Any ideas? Any open source project? I am sure, there must be someone who has solved this problem before.

what's your ratio of writes to reads? when reading, what's the pattern of access (random, clumpy, ...)? what's the nature of the unique key for each record (doesn't matter, uuid, timestamp)? — Ron, Oct 15 '09 at 15:12
You are going to struggle to get anything sub-millisecond - I know some guys working with hard-core trading systems and they are proud of their sub-5ms end-to-end latency on trades. The requirements seem very vague to me and with way too little detail. — SteveD, Oct 15 '09 at 15:21

csaba · Answer 1 · 2012-02-28T17:15:51.463

There are lot of diverse tools and methods, but I think none of them can shine in all of the requirements.

For low latency, you can only rely on in-memory data access - disks are physically too slow (and SSDs too). If data does not fit in the memory of a single machine, we have to distribute our data to more nodes summing up enough memory.

For persistency, we have to write our data to disk after all. Supposing optimal organization this can be done as background activity, not affecting latency. However for reliability (failover, HA or whatever), disk operations can not be totally independent of the access methods: we have to wait for the disks when modifying data to make shure our operation will not disappear. Concurrency also adds some complexity and latency.

Data model is not restricting here: most of the methods support access based on a unique key.

We have to decide,

if data fits in the memory of one machine, or we have to find distributed solutions,
if concurrency is an issue, or there are no parallel operations,
if reliability is strict, we can not loose modifications, or we can live with the fact that an unplanned crash would result in data loss.

Solutions might be

self implemented data structures using standard java library, files etc. may not be the best solution, because reliability and low latency require clever implementations and lots of testing,
Traditional RDBMS s have flexible data model, durable, atomic and isolated operations, caching etc. - they actually know too much, and are mostly hard to distribute. That's why they are too slow, if you can not turn off the unwanted features, which is usually the case.
NoSQL and key-value stores are good alternatives. These terms are quite vague, and cover lots of tools. Examples are
- BerkeleyDB or Kyoto Cabinet as one-machine persistent key-value stores (using B-trees): can be used if the data set is small enough to fit in the memory of one machine.
- Project Voldemort as a distributed key-value store: uses BerkeleyDB java edition inside, simple and distributed,
- ScalienDB as a distributed key-value store: reliable, but not too slow for writes either.
- MemcacheDB, Redis other caching databases with persistency,
- popular NoSQL systems like Cassandra, CouchDB, HBase etc: used mainly for big data.

A list of NoSQL tools can be found eg. here.

Voldemort's performance tests report sub-millisecond response times, and these can be achieved quite easily, however we have to be careful with the hardware too (like the network properties mentioned above).

score 5 · Answer 2 · answered Oct 15 '09 at 14:31

5

Have a look at LinkedIn's Voldemort.

answered Oct 15 '09 at 14:31

fvu

32,488
6
61
79

score 4 · Answer 3 · answered Oct 15 '09 at 14:34

4

If all the data fits in memory, MySQL can run in memory instead of from disk (MySQL Cluster, Hybrid Storage). It can then handle storing itself to disk for you.

answered Oct 15 '09 at 14:34

Dean J

39,360
16
67
93

score 4 · Answer 4 · answered Oct 15 '09 at 15:20

4

What about something like CouchDB?

answered Oct 15 '09 at 15:20

Mark

28,783
8
63
92

score 3 · Answer 5 · edited Oct 26 '12 at 08:16

I would use a BlockingQueue for that. Simple, and built into Java.
I do something similar using realtime data from Chicago Merchantile Exchange.
The data is sent to one place for realtime use... and to another place (via TCP), using a BlockingQueue (Producer/Consumer) to persist the data to a database (Oracle,H2).
The Consumer uses a time delayed commit to avoid fdisk sync issues in the database.
(H2 type databases are asyncronous commit by default and avoid that issue) I log the persisting in the Consumer to keep track of the queue size to be sure
it is able to keep up with the Producer. Works pretty good for me.

score 2 · Answer 6 · answered Oct 15 '09 at 14:10

2

MySQL with shards may be a good idea. However, it depends on what is the data volume, transactions per second and latency you need.

In memory databases are also a good idea. In fact MySQL provides memory-based tables as well.

answered Oct 15 '09 at 14:10

Shantanu Kumar

1,240
1
12
14

yeah...in memory databases are good...but my previous experience with HSQLDB is not so great...in fact we had determined HQSQL db was taking substantial time in our processing...Not sure about MSQL though – AAK Oct 15 '09 at 14:30

score 2 · Answer 7 · answered Oct 15 '09 at 15:03

2

Would a Tuple space / JavaSpace work? Also check out other enterprise data fabrics like Oracle Coherence and Gemstone.

answered Oct 15 '09 at 15:03

Kevin

30,111
9
76
83

score 2 · Answer 8 · answered Dec 19 '13 at 19:54

2

MapDB provides highly performant HashMaps/TreeMaps that are persisted to disk. Its a single library that you can embed in your Java program.

answered Dec 19 '13 at 19:54

Andrejs

26,885
12
107
96

score 1 · Answer 9 · answered Oct 15 '09 at 14:12

1

Have you actually proved that using an out-of-process SQL database like MySQL or SQL Server is too slow, or is this an assumption?

You could use a SQL database approach in conjunction with an in-memory cache to ensure that retrievals do not hit the database at all. Despite the fact that the records are plaintext I would still advise using SQL over a flat file solution (e.g. using a text column in your table schema) as the RDBMS will perform optimisations that a file system cannot (e.g. caching recently accessed pages, etc).

However, without more information about your access patterns, expected throughput, etc. I can't provide much more in the way of suggestions.

answered Oct 15 '09 at 14:12

Adamski

54,009
15
113
152

Yes. Our legacy system uses RDBMS and it takes few milliseconds for retrieval of data. This is high frequency application, the speed required in sub millisecond for whole message processing where storage and retrieval are just one part of the message processing – AAK Oct 15 '09 at 14:26
More importantly, what are your access patterns? Is the data sequential (e.g. time series)? Is the data written once and read many times, or can be potentially be updated? There are bespoke solutions for this (e.g. KDB) but it largely depends on your use case. – Adamski Oct 15 '09 at 16:10

score 1 · Answer 10 · edited May 13 '13 at 19:11

1

How much does it matter if you lose a record or two? Where are they coming from? Do you have a transactional relationship with the source?

If you have serious reliability requirements then I think you may need to be prepared to pay some DB Overhead.

Perhaps you could separate the persistence problem from the in-memory problem. Use a pup-sub approach. One subscriber look after in-memory, the other persisting the data ready for subsequent startup?

Distributed cahcing products such as WebSphere eXtreme Scale (no Java EE dependency) might be relevent if you can buy rather than build.

edited May 13 '13 at 19:11

Arjan Tijms

37,782
12
108
140

answered Oct 15 '09 at 14:14

djna

54,992
14
74
117

The reliability requirements are pretty high. I was also inclined towards some caching solution. EHCache? – AAK Oct 15 '09 at 14:28

score 1 · Answer 11 · edited Oct 15 '09 at 15:08

1

How bad would it be if you lose a couple of entries in case of a crash?

If it isn't that bad the following approach might work for you:

Create flat files for each entry, name of file equals id. Possible one file for a not so big number of consecutive entries.

Make sure your controller has a good cache and/or use one of the existing caches implemented in Java.

Talk to a file system expert how to make this really fast

It is simple and it might be fast. Of course you lose transactions including the ACID principles.

edited Oct 15 '09 at 15:08

Svante

50,694
11
78
122

answered Oct 15 '09 at 14:33

Jens Schauder

77,657
34
181
348

The reliability requirements are pretty high. We cannot afford to lose any data upon crash... – AAK Oct 15 '09 at 14:41

Peter Hoffmann · Answer 12 · 2009-10-15T14:53:53.603

1

If you are looking for a simple key-value store and don't need complex sql querying, Berkeley DB might be worth a look.

Another alternative is Tokyo Cabinet, a modern DBM implementation.

edited Oct 15 '09 at 14:53

answered Oct 15 '09 at 14:48

Peter Hoffmann

56,376
15
76
59

score 1 · Answer 13 · answered Oct 21 '09 at 12:34

Sub millisecond r/w means you cannot depend on disk, and you have to be careful about network latency. Just forget about standard SQL based solutions, main-memory or not. In a ms, you cannot get more than 100 KByte over a GBit network. Ask a telecom engineer, they are used to solving these kind of problems.

score 0 · Answer 14 · answered Mar 18 '17 at 22:57

Chronicle Map is a ConcurrentMap implementation which stores keys and values off-heap, in a memory-mapped file. So you have persistence on JVM restart.

ChronicleMap.get() is consistently faster than 1 us, sometimes as fast as 100 ns / operation. It's the fastest solution in the class.

score -1 · Answer 15 · answered Oct 15 '09 at 14:13

-1

Will all the records and keys you need fit in memory at once? If so, you could just use a HashMap<String,String>, since it's Serializable.

answered Oct 15 '09 at 14:13

wdebeaum

4,101
1
22
12

-1 from me. You'll need to manually serialize the entire HashMap on each insert, which is obviously very slow. – Tim Frey Oct 15 '09 at 14:16
yep...but how about real time data persistence? I need to persist data as it comes so that if the JVM crashes I do not lose the data... – AAK Oct 15 '09 at 14:29
@AAK: you could just serialize and store each change. Then you don't have a immediately usable persistence storage, but have a log from which you can rebuild the storage in the event of an error. – Joachim Sauer Oct 21 '09 at 12:40
(btw, I'm not saying that this is the ideal soluton, I'm just saying that it could be made to work) – Joachim Sauer Oct 21 '09 at 12:40

Java Fast Data Storage & Retrieval

15 Answers15

Linked