1

I'm a newcomer to Riak and I've been reading this chapter from riak's docs. It goes to show that by adding structure information to buckets and keys one can overcome some of the limitations of key/value operations.

Though the article states an example on how such key would be structured:

sensor data keys could be prefaced by sensor_ or temp_sensor1_ followed by a timestamp (e.g. sensor1_2013-11-05T08:15:30-05:00)

no method is mentioned on how to query the data by key prefix (e.g sensor1_). Looking around stackoverflow I found this question. In it MapReduce and key filtering are mentioned as a possible solution. But the documentation on key filters states that they are a soon-to-be deprecated feature. I also checked out Riak search as a possible way but wasn't able to find a way to query data by key prefix.

My question is: What is the best way to search data by key prefix? I would greatly appreciate an example.

Community
  • 1
  • 1

1 Answers1

2

The best way to search for a key prefix is to not do it if you don't need to, i.e. design around that search pattern if you can. The primary way to do that is to use deterministic keys that your application can easily compute. That said, if you cannot avoid building your application to require searching on key prefixes there are couple of things you can do (all of which have their drawbacks).

  1. Key Filters - http://docs.basho.com/riak/latest/dev/references/keyfilters/ - as you noted already these are marked as deprecated and not recommended at this point.
  2. MapReduce - http://docs.basho.com/riak/latest/dev/advanced/mapreduce/ - a good option if you can query in batches but not really suited for real time querying. You could cache the query results if precomputing the queries is helpful.
  3. Riak Search 2.0 (Solr) - http://docs.basho.com/riak/latest/dev/using/search/ - this is probably the easiest method to implement from an application perspective and allows to query your keys using a query along the lines of: 'curl "$RIAK_HOST/search/sensor?wt=json&q=_yz_rk:sensor1_*"'. Using search does come with a performance hit over straight key based queries but you can cache queries.
  4. Data Modeling - querying by key directly is always going to provide the best performance as mentioned above. One option is to to take advantage of Riak's Data Types (CRDTs) and create a bucket that uses sets. You could create a set for each sensor that contained a list of keys associated with that sensor in the first bucket. Then you can iterate over the keys in the set and do a multi-get to return all of associated records.

Hope this gives you some ideas.

Craig
  • 1,001
  • 6
  • 11
  • 1
    To add another option to @Craig's data modeling ideas: aggregating keys in time periods. Instead of writing each sensor value in its very own key, group them into a single key/value pair containing all sensor values for a 1 minute period(or whatever size works for you). Then you have well-defined predictable keys that can be iterated without needing to scan for them. – Joe Dec 17 '14 at 19:30
  • @Craig Thanks for the comprehensive answer. Still wondering though, why if this search pattern is recommended by Riak's designers there isn't any straightforward way of implementing it? Except for Key Filter which are being deprecated. – Nikolay Manolov Dec 18 '14 at 08:11
  • @NikolayManolov Great question. It is important to keep in mind that the mechanism Riak uses to distribute data around the cluster evenly makes it difficult to iterate over keys natively. This is why the List Keys op isn't recommended in production (it is very expensive). If you can't design around the problem I'd recommend Solr and then MapReduce. – Craig Dec 18 '14 at 13:44
  • 2
    The recommendation is to create meaningful key names that you can predict and therefore iteratively get, not to create filterable key names that you have to conduct a search operation in order to determine which keys you want. For example, if you know you should have sensor data for every minute and you want all sensor readings for a particular day, you could loop through and get the 1440 keys without needing to search for anything. If you wanted run a batch process on them, pass the keys as input to a map reduce job. – Joe Dec 18 '14 at 17:34