Which clustered NoSQL DB for a Message Storing purpose?

Question

Yet another question about which NoSQL to choose. However, I haven't found yet someone asking for this type of purpose, message storing...

I have an Erlang Chat Server made, I'm already using MySQL for storing friend list, and "JOIN needed" informations.

I would like to store Messages (That user has not receive because he was offline...) and retrieve them.

I have made a pre-selection of NoSQL, I can't use things like MongoDB due to it's RAM oriented paradigm, and fail to cluster like others. I have down my list to 3 choices I guess :

Hbase
Riak
Cassandra

I know that their model are quit different, one using key/value, the other using SuperColumns and co.

Until now I had a preference for Riak due to it's stable client library for Erlang.

I know that I can use Cassandra with Thrift, but it seems not very stable with Erlang (I haven't got good returns about it)

I don't really know anything about HBase right now, just know it exist and is based on Dynamo like Cassandra and Riak.

So Here's what I need to do :

Store from 1 to X messages per registered user.
Get the number of stored messages per user.
retrieve all messages from an user at once.
delete all messages from an user at once.
delete all messages that are older than X months

Right now, I'm really new to those NoSQL DB, I always been a MySQL aficionados, This is why I ask you this question, as a Newbie, would someone who has more experience than I could Help me to choose which one is better, and would let me do everything I want to without to much hassle...

Thanks !

@BrianRoach: They do not seems to think so on this question http://stackoverflow.com/questions/2892729/mongodb-vs-cassandra this is the same kind of question. — TheSquad, Apr 23 '12 at 19:41
the fact that one question wasn't downvoted and closed as it should have been doesn't affect the fact that ... it's not appropriate as per the FAQ and meta. In addition, that was 2 years ago - things have evolved since then with the addition of the other sites. — Brian Roach, Apr 23 '12 at 20:19

Dmitri Zagidulin · Accepted Answer · 2015-11-03T15:38:00.103

I can't speak for Cassandra or Hbase, but let me address the Riak part.

Yes, Riak would be appropriate for your scenario (and I've seen several companies and social networks use it for a similar purpose).

To implement this, you would need the plain Riak Key/Value operations, plus some sort of indexing engine. Your options are (in rough order of preference):

CRDT Sets. If your 1-N collection size is reasonably sized (let's say, there's less than 50 messages per user or whatever), you can store the keys of the child collection in a CRDT Set Data Type.
Riak Search. If your collection size is large, and especially if you need to search your objects on arbitrary fields, you can use Riak Search. It spins up Apache Solr in the background, and indexes your objects according to a schema you define. It has pretty awesome searching, aggregation and statistics, geospatial capabilities, etc.
Secondary Indexes. You can run Riak on top of an eLevelDB storage back end, and enable Secondary Index (2i) functionality.

Run a few performance tests, to pick the fastest approach.

As far as schema, I would recommend using two buckets (for the setup you describe): a User bucket, and a Message bucket.

Index the message bucket. (Either by associating a Search index with it, or by storing a user_key via 2i). This lets you do all of the required operations (and the message log does not have to fit into memory):

Store from 1 to X messages per registered user - Once you create a User object and get a user key, storing an arbitrary amount of messages per user is easy, they would be straight up writes to the Message bucket, each message storing the appropriate user_key as a secondary index.
Get the number of stored messages per user - No problem. Get the list of message keys belonging to a user (via a search query, by retrieving the Set object where you're keeping the keys, or via a 2i query on user_key). This lets you get the count on the client side.
retrieve all messages from a user at once - See previous item. Get the list of keys of all messages belonging to the user (via Search, Sets or 2i), and then fetch the actual messages for those keys by multi-fetching the values for each key (all the official Riak clients have a multiFetch capability, client-side).
delete all messages from a user at once - Very similar. Get list of message keys for the user, issue Deletes to them on the client side.
delete all messages that are older than X months - You can add an index on Date. Then, retrieve all message keys older than X months (via Search or 2i), and issue client-side Deletes for them.

Funny things in life... 3 years after I post this question, I'm starting another project, and had some questions I needed to be answered. Odds are you answered them ! So here 3 years later, a validated question and a +1 for the futur seing ;-) — TheSquad, Nov 02 '15 at 14:16
I edited the answer to account for a couple of new Riak features that have come along since then -- specifically, Search and Data Types. — Dmitri Zagidulin, Nov 03 '15 at 15:38
Thanks for editing the answer to today's features. Yeah I was going to check out Riak Search. Solr is pretty awesome when you know how to use it. — TheSquad, Nov 03 '15 at 16:34

score 0 · Answer 2 · answered Apr 23 '12 at 18:38

0

I can't speak to Riak at all, but I'd question your choice to discard Mongo. It's quite good as long as you leave journaling turned off and don't completely starve it for RAM.

I know quite a lot about HBase, and it sounds like it would meet your needs easily. Might be overkill depending on how many users you have. It trivially supports things like storing many messages per user, and has functionality for automatic expiration of writes. Depending on how you architect your schema it may or may not be atomic, but that shouldn't matter for your use case.

The downsides are that there is a lot of overhead to set it up correctly. You need to know Hadoop, get HDFS running, make sure your namenode is reliable, etc. before standing up HBase.

answered Apr 23 '12 at 18:38

Chris Shain

50,833
6
93
125

1

I guess that MongoDB would be a good choice also, but I really would like to have a model based on Dynamo (no single point of failure), AFAIK MongoDB is not based on that, but I might be wrong, Am I ? What's your downside point about Cassandra ? – TheSquad Apr 23 '12 at 19:05
My Idea is not stopped per say on discard MongoDB, but right now, I haven't really been convinced it is the best solution for a clustered DB... it seems that the 3 I have chosen for now are best on this principal point, Don't you think ? – TheSquad Apr 23 '12 at 19:12
When sharded and with each chard replicated, Mongo has no SPOF. HBase does- the HDFS NameNode. I don't know enough about Cassandra to say much, other than it has no SPOF and is very similar in capability to HBase. – Chris Shain Apr 23 '12 at 19:31

score 0 · Answer 3 · answered Apr 25 '12 at 12:08

I'd recommend using distributed key/value store like Riak or Couchbase and keep the whole message log for each user serialized (into binary erlang terms or JSON/BSON) as one value.

So with your usecases it will look like:

Store from 1 to X messages per registered user - when user comes online spawn a stateful gen_server, which gets from storage and deserializes whole message log on startup, receives new messages, appends them to it's copy of log, on end of session it terminates, serializes the changed log and sends it to storage.
Get the number of stored messages per user - get the log out, deserialize, count; or maybe store count alongside in a separate k/v pair.
retrieve all messages from an user at once - just pull it from storage.
delete all messages from an user at once - just delete value from storage.
delete all messages that are older than X months - get, filter, put back.

The obvious limitation - message log has to fit in memory.

If you decide to store each message individually it will require from distributed database to sort them after retrieval if you want them to be in time-order, so it will hardly help to handle larger-than-memory datasets. If it is required - you will anyway end up with some more tricky scheme.

Unfortunately, message log has a great chance not to fit in memory... This is why I'm probably going with Cassandra it's column oriented database looks promising, and if it works for twitter's tweets, it will be working for me... (which can do more, can do less ;-) — TheSquad, Apr 26 '12 at 10:36
You could also split the message log up into pages, where one page is stored as one value. I don't have personal experience with this, but it's described in this talk by Voxer: http://vimeo.com/52827773 — Joe Rideout, Apr 02 '14 at 20:55

Which clustered NoSQL DB for a Message Storing purpose?

3 Answers3

Linked