simplest example of a query by date range in cassandra 1.x

Question

I want to store an ID and a date and I want to retrieve all entries from dateA up to dateB, what exactly do I need to be able to perform select from my_column_family where date >= dateA and date < dateB; ?

btw this doesn't have an answer: http://stackoverflow.com/questions/3535017/how-to-query-cassandra-by-date-range (and it's for an old version anyway) — Samus_, Dec 26 '11 at 19:44

score 1 · Answer 1 · answered Dec 27 '11 at 18:46

the guys at #cassandra (IRC) helped me find a way, there's many subtle details so I'd like to document that here.

first you need to declare a column family similar to this (examples from cassandra-cli):

create column family users with comparator=UTF8Type and key_validation_class=UTF8Type and column_metadata=[
    {column_name: id, validation_class: LongType}
    {column_name: name, validation_class: UTF8Type, index_type: KEYS}
    {column_name: age, validation_class: LongType}
];

few important things about this declaration:

the comparator and key_validation_class are there to be able to use strings as key names
the first declared column is special, it's the "row key" which is used to address each row and therefore cannot contain duplicate values (the INSERT is really an UPSERT so when there's duplicates the new values overwrite the old ones)
the second column declares a "secondary index" on its values (more on that below)
the dates are stored as Long datatypes, interpretation is up to the client

now let's add some values:

set users[1][name] = john;
set users[1][age]  = 19;
set users[2][name] = jane;
set users[2][age]  = 21;
set users[3][name] = john;
set users[3][age]  = 32;

according to this: http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/ Cassandra does not support the < operators, what it does is to manually exclude the rows that don't match but it does that AFTER there's a resultset and it also refuses to do so unless and actual filtering has taken place.

what that means is that a query like get users where age > 20; will return null but if we add a predicate that includes = it'll magically work.

here's where the secondary index is important, without it you can't use = so on this example I can do get users where name = jane; but I cannot ask for get users where age = 21;

the funny thing is that, after using = the < works so having a secondary index allows you to ask for get users where name = john and age > 20; and it'll filter correctly.

this is a way to solve the problem however it relies on duplicate values on the secondary-indexed column which may not be the best idea; I'd love to hear alternatives and discuss pros and cons of each. I've been told that this could also be done with TimeUUID and get_slice from the ordered partitioner but I don't know how to do it that way and also I've read that the ordered partitioner is bad at balancing. — Samus_, Dec 27 '11 at 18:46
also this doesn't work with supercolumns since they don't support secondary indexes. — Samus_, Jan 19 '12 at 20:27
Actually, the fact that only = works on Secondary indexes is not magic, it's because of the nature of the indexes which aren't BTrees. — Gepsens, Mar 03 '13 at 12:39

score 1 · Answer 2 · answered Dec 27 '11 at 20:08

There are a few ways to solve this. The simplest is probably the secondary index solution with the equality limitation mentioned in your own answer. I've used this method, adding an additional column called 'valid', setting the value to 1. Then the queries can become where valid=1 and date>nnnn

The other solutions require additional column families and additional queries.

When loading the data, create and add to a column family which contains the timestamps as keys, and each entry would list all the user ids as column names.

If the partitioning strategy is ordered, then a single RangeSliceQuery can specify the date range as a key range and get all the columns for each key. Then iterate through the result keys, using the column values for each user id and if needed, query the original column family for the data associated with each id. Cassandra always stores the column names sorted, and can be reversed when reading.

But, as documented, the ordered partitioner is not ideal, leading to hot spots and difficulty in load balancing the nodes.

Without the ordered partitioner, still keeping the timestamp column family, you would have to create another column family while loading data where you can store all the timestamps as the columns under one or more known keys (e.g. 'created' or 'updated'). The first query would be a SliceQuery for a known key, and then the column names (as timestamps) would provide the keys for the MultigetSliceQuery to the timestamp column family.

I've used variations on this, usually adding Composite keys or columns for additional flexibility.

Do I really need to add a dummy column and set it to 1 in every one of my insertions just to be able to filter on a totally unrelated other column by range? The tools are working against me there... — Kevin Coulombe, Aug 16 '12 at 16:02
With a single secondary index, I believe so.. but .. look at [the answer](http://stackoverflow.com/questions/11915255/why-are-super-columns-in-cassandra-no-longer-favoured) to a totally unrelated question to get some ideas how CQL3 can leverage composites to possibly solve your problem. See [CQL#select](http://cassandra.apache.org/doc/cql3/CQL.html#selectStmt) for more details. — libjack, Aug 16 '12 at 20:32
I read that it is possible to perform range queries efficiently if we use the column as part of a composite primary key, but that only works with an ordered partitioner which is recommended against because it messes up the sharding strategy... Using a dummy column isn't better. It requires a full table scan so it is very slow on large datasets. Are there any other solution to range filtering that works on large distributed datasets in Cassandra? — Kevin Coulombe, Aug 17 '12 at 15:03
Columns are always stored sorted, so if the column names are composites, you can get an effective query via ColumnSliceIterator, etc. In the case where row keys are composites, then yes the partitioner would dictate. From the CQL example, note that primary key is not the same as row key. — libjack, Aug 17 '12 at 15:33

simplest example of a query by date range in cassandra 1.x

2 Answers2