17

We are looking at using Cassandra to store a stream of information coming from various sources.

One issue we are facing is the best way to query between two dates.

For example we will need to retrieve an object between datetime dt1 and datetime dt2.

We are currently considering the created unix timestamp as the key pointing to the actual object then using get_key_range to query to retrieve?

Obviously this wouldn't work if two items have the same timestamp.

Is this the best way to do datetime in noSQL stores in general?

Rockett
  • 287
  • 5
  • 10
  • 2
    Curious as to how you ultimately modeled your data? We're looking at something similar and I am trying to gather best practices, etc. – AlexGad Sep 06 '12 at 17:48

1 Answers1

15

Cassandra rows can be very large, so consider modeling it as columns in a row rather than rows in a CF; then you can use the column slice operations, which are faster than row slices. If there are no "natural" keys associated with this then you can use daily or hourly keys like "2010/02/08 13:00".

Otherwise, yes, using range queries (get_key_range is deprecated in 0.5; use get_range_slice) is your best option.

jbellis
  • 19,347
  • 2
  • 38
  • 47
  • 1
    How large is very large? On slide 41 of the presentation at http://www.slideshare.net/jbellis/cassandra-open-source-bigtable-dynamo you say "Millions of columns per row" for 0.5. Is columns in a row still the way to go for really big time series? – Adam Hollidge Mar 12 '10 at 14:15
  • Yes, columns are the way to go. – z8000 Mar 25 '10 at 19:55
  • The reason to use columns instead of rows are partitioners? Since the RandomPartitioner doesn't preserve order, while ByteOrderedPartitioner creates hotspots. But isn't the partitioning based on row keys? Means if we store large number of columns into a single row, it will also suffer the hot spot problem? – Gary Shi Nov 02 '11 at 11:00
  • 1
    @Gary Shi: You are correct. To spread this evenly across a cluster, people take this idea a bit further and break the data into epochs, assigning each epoch its own row, as described here: http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/. – user359996 Mar 08 '12 at 05:56
  • @user359996: I get your idea, but it will still add complexity to apps for e.g. scanning requirements. I still prefer the Google BigTable way of doing this. – Gary Shi Mar 13 '12 at 09:42