16

In the last days I played a bit with riak. The initial setup was easier then I thought. Now I have a 3 node cluster, all nodes running on the same vm for the sake of testing.

I admit, the hardware settings of my virtual machine are very much downgraded (1 CPU, 512 MB RAM) but still I am a quite surprised by the slow performance of riak.

Map Reduce

Playing a bit with map reduce I had around 2000 objects in one bucket, each about 1k - 2k in size as json. I used this map function:

function(value, keyData, arg) {
    var data = Riak.mapValuesJson(value)[0];

    if (data.displayname.indexOf("max") !== -1) return [data];
    return [];
}

And it took over 2 seconds just for performing the http request returning its result, not counting the time it took in my client code to deserialze the results from json. Removing 2 of 3 nodes seemed to slightly improve the performance to just below 2 seconds, but this still seems really slow to me.

Is this to be expected? The objects were not that large in bytesize and 2000 objects in one bucket isnt that much, either.

Insert

Batch inserting of around 60.000 objects in the same size as above took rather long and actually didnt really work.

My script which inserted the objects in riak died at around 40.000 or so and said it couldnt connect to the riak node anymore. In the riak logs I found an error message which indicated that the node ran out of memory and died.

Question

This is really my first shot at riak, so there is definately the chance that I screwed something up.

  • Are there any settings I could tweak?
  • Are the hardware settings too constrained?
  • Maybe the PHP client library I used for interacting with riak is the limiting factor here?
  • Running all nodes on the same physical machine is rather stupid, but if this is a problem - how can i better test the performance of riak?
  • Is map reduce really that slow? I read about the performance hit that map reduce has on the riak mailing list, but if Map Reduce is slow, how are you supposed to perform "queries" for data needed nearly in realtime? I know that riak is not as fast as redis.

It would really help me a lot if anyone with more experience in riak could help me out with some of these questions.

Max
  • 15,693
  • 14
  • 81
  • 131
  • 1
    Why dont you ask at riak mailing list? Most of basho employees are there to help you with your problems. – Joshua Partogi May 15 '11 at 12:38
  • I know this has been answered but to just point out: "RAM is one of the most important factors – RAM availability directly affects what Riak backend you should use (see question below), and is also required for complex MapReduce queries." from: http://basho.com/top-five-questions-about-riak-2/ – scape Sep 18 '13 at 13:06

3 Answers3

31

This answer is a bit late, but I want to point out that Riak's mapreduce implementation is designed primarily to work with links, not entire buckets.

Riak's internal design is actually pretty much optimized against working with entire buckets. That's because buckets are not considered to be sequential tables but a keyspace distributed across a cluster of nodes. This means that random access is very fast — probably O(log n), but don't quote me on that — whereas serial access is very, very, very slow. Serial access, the way Riak is currently designed, necessarily means asking all nodes for their data.

Incidentally, "buckets" in Riak terminology are, confusingly and disappointingly, not implemented the way you probably think. What Riak calls a bucket is in reality just a namespace. Internally, there is only one bucket, and keys are stored with the bucket name as a prefix. This means that no matter how small or large you bucket is, enumerating the keys in a single bucket of size n will take m time, where m is the total number of keys in all buckets.

These limitations are implementation choices by Basho, not necessarily design flaws. Cassandra implements the exact same partitioning model as Riak, but supports efficient sequential range scans and mapreduce across large amounts of keys. Cassandra also implements true buckets.

Alexander Staubo
  • 3,148
  • 2
  • 25
  • 22
  • Could you elaborate on how Cassandra implements true buckets? – Carlo Pires Jan 05 '12 at 22:50
  • Since Riak 1.0 it is possible to use multiple backends, which is useful if you want different storage engines or different configurations. Or in this case, if you need to traverse all entries of a bucket without getting a performance penalty from entries in others buckets. With multiple backends, the _m_ is reduced to the total number of keys in all buckets of the same backend. – Ulrik Jan 18 '12 at 06:37
  • @CarloPires: In Cassandra, individual keyspaces (analogous to Riak's buckets) are stored and indexed separately. – Alexander Staubo Jan 18 '12 at 19:41
  • I wonder why they can't just keep a separate mapping of buckets => keys in order to get the value of *m* down to just about *n*. Consult the key list, then use that fixed list of keys to get the records. You can do this separately via some other data store (such as Redis), but it seems like a common enough scenario to do this inside of Riak itself. – d11wtq Apr 14 '13 at 08:13
4

A recommendation I'd have now that some time has passed and several new versions of Riak have come about is this. Never rely on full bucket map/reduce, that's not an optimized operation, and chances are very good there are other ways to optimize your map/reduce so you don't have to look through so much data to pull out the singlets you need.

Secondary indices now available in newer versions of Riak are definitely the way to go in this regard. Put an index on the objects you want to find (perhaps named 'ismax_int' with a value of 0 or 1). You can map/reduce a secondary index with hundreds of thousands of keys in microseconds which a full bucket scan would have taken multiple seconds to consider.

MightyE
  • 2,679
  • 18
  • 18
2

I don't have direct experience of Riak, but have worked with Cassandra a little, which is similar.

Firstly, performance will probably depend a lot on the number of cores available, and the memory. These systems are usually heavily pipelined and concurrent and benefit from a lot of cores. 4+ cores and 4GB+ of RAM would be a good starting point.

Secondly, MapReduce is designed for batch processing, not realtime queries.

Riak and all similar Key-Value stores are designed for high write performance, high read performance for simple lookups, no complex querying at all.

Just for comparison, Cassandra on a single node (6 core, 6GB) can do 20,000 individual inserts per second.

DNA
  • 42,007
  • 12
  • 107
  • 146
  • "MapReduce MapReduce is designed for batch processing, not realtime queries. At all." - thanks, that´s what I wondered: am i using map reduce wrong or is it just the wrong tool for what I am trying to do. Thanks a lot for your insights! – Max May 16 '11 at 05:06
  • 1
    Actually MapReduce was invented for batch-processing, but it's just an abstract model of computation. Riak's implementation has been designed for real-time queries, but only in terms of link following, not across entire buckets. – Alexander Staubo Oct 01 '11 at 20:40