6

TL;DR:

I'd like to have recommendations for a distributed key-value storage, for avg. entry size of up to 50KB, to be installed on a Linux environment (dedicated servers).
A file-system solution would do.
I found a few solutions: Ceph, Cassandra, Riak, and a few more.

Details

I'm looking for a storage solution for one of our components, it should be a key-value storage, flat namespace.

Scenario

The read/write patterns are very simple:

Once a key-value is written, there are a few reads within the next hours.

After that, nothing touches the given key-value. We'd like to keep the data for future purposes, "Storage mode".

Other usage aspects

  • OS: Linux
  • Python client/connector
  • Total size: up to 80TB (this value also represents future needs).
  • Avg Entry Size (for a single value in a k-v pair): 10 to 50 KB, uncompressed, mostly textual data
  • Compression: either built-in or external.
  • Encryption: not needed
  • Network bandwidth: 1Gb, single LAN
  • Servers: dedicated (not in the cloud)

Most important requirements

The "base" requirements are:

  • OS: Linux
  • Python client/connector OR RESTful API via HTTP
  • Can easily store up to 80TB (this value also represents future needs).
  • Max read latency: a few seconds for first reads, 30 seconds for "storage mode" (see above for explanation)
  • Built in replication (so that data is stored on more than a single node)

Nice to have

  • RESTful gateway
  • Background data backup to another store (for data recovery in case of a disaster).
  • Easy to configure

What I've found so far

  • Ceph
  • HDFS
  • HBase on top of HDFS
  • Lustre
  • GlusterFS
  • Mongo's GridFS - but can I trust Mongo's infrastructure?
  • Cassandra - not an option, since the merge process consumes double disk size
  • Riak - looks like it has the same issue as Cassandra, needs more research
  • Swift + OpenStack (actual storage can be on Amazon S3)
  • Voldemort
  • There are dozens of additional tools, but I won't write them here since some of them have proprietary license, and others seem to be immature.

I'd appreciate any recommendation on any of the tools I mentioned above (with total capacity of more than 50TB), or on a tool you think is sufficient.

Ron Klein
  • 9,178
  • 9
  • 55
  • 88
  • This is a [shopping question](http://blog.stackoverflow.com/2010/11/qa-is-hard-lets-go-shopping/). You've already successfully identified the products that could work for you, but anything that we could add would be subjective opinion. Here's mine: If you're looking for a key-value file store, do Ceph. If you're looking for a *filesystem* to treat as a key-value store, Gluster will do just as well, but Ceph can do that also. – Charles Feb 04 '13 at 08:55
  • @Charles, I agree, but I still think other could benefit from it. Thanks for your opinion! – Ron Klein Feb 04 '13 at 09:08
  • 1
    I use Ceph - it's been working for us so far. – Erik Aronesty Dec 04 '13 at 21:24
  • FoundationDB is the (new) kid in the block – amirouche Nov 03 '18 at 19:57
  • You may find [this](https://stackoverflow.com/a/53159654/2361497) answer interesting. – Vitaly Isaev Nov 05 '18 at 17:57

1 Answers1

0

Just use Ceph (I mean direct librados usage). Don't use GlusterFS -- it's hangy.

socketpair
  • 1,893
  • 17
  • 15