TL;DR:
I'd like to have recommendations for a distributed key-value storage, for avg. entry size of up to 50KB, to be installed on a Linux environment (dedicated servers).
A file-system solution would do.
I found a few solutions: Ceph, Cassandra, Riak, and a few more.
Details
I'm looking for a storage solution for one of our components, it should be a key-value storage, flat namespace.
Scenario
The read/write patterns are very simple:
Once a key-value is written, there are a few reads within the next hours.
After that, nothing touches the given key-value. We'd like to keep the data for future purposes, "Storage mode".
Other usage aspects
- OS: Linux
- Python client/connector
- Total size: up to 80TB (this value also represents future needs).
- Avg Entry Size (for a single value in a k-v pair): 10 to 50 KB, uncompressed, mostly textual data
- Compression: either built-in or external.
- Encryption: not needed
- Network bandwidth: 1Gb, single LAN
- Servers: dedicated (not in the cloud)
Most important requirements
The "base" requirements are:
- OS: Linux
- Python client/connector OR RESTful API via HTTP
- Can easily store up to 80TB (this value also represents future needs).
- Max read latency: a few seconds for first reads, 30 seconds for "storage mode" (see above for explanation)
- Built in replication (so that data is stored on more than a single node)
Nice to have
- RESTful gateway
- Background data backup to another store (for data recovery in case of a disaster).
- Easy to configure
What I've found so far
- Ceph
- HDFS
- HBase on top of HDFS
- Lustre
- GlusterFS
- Mongo's GridFS - but can I trust Mongo's infrastructure?
- Cassandra - not an option, since the merge process consumes double disk size
- Riak - looks like it has the same issue as Cassandra, needs more research
- Swift + OpenStack (actual storage can be on Amazon S3)
- Voldemort
- There are dozens of additional tools, but I won't write them here since some of them have proprietary license, and others seem to be immature.
I'd appreciate any recommendation on any of the tools I mentioned above (with total capacity of more than 50TB), or on a tool you think is sufficient.