37

I have to work with a column family that has (user_id, timestamp) as key. In my query I would like to fetch all records in a given time range independent of the user_id. This is the exact table schema:

CREATE TABLE userlog (
  user_id text,
  ts timestamp,
  action text,
  app_type text,
  channel_name text,
  channel_session_id text,
  pid text,
  region_id text,
  PRIMARY KEY (user_id, ts)
)

I tried to run

SELECT * FROM userlog  WHERE ts >= '2013-01-01 00:00:00+0200' AND  ts <= '2013-08-13 23:59:00+0200' ALLOW FILTERING;

which works fine on my local cassandra installation containing a small data set but fails with

Request did not complete within rpc_timeout.

on the productive system containing all the data.

Is there a, preferably cql, query that runs smoothly with the given column family or de we have to change the design?

Faber
  • 1,504
  • 2
  • 13
  • 21

3 Answers3

40

The timeout is because Cassandra is taking longer than the timeout (default is 10 seconds) to return the data. For your query, Cassandra will attempt to fetch the entire dataset before returning. For more than a few records this can easily take longer than the timeout.

For queries that are producing lots of data you need to page e.g.

SELECT * FROM userlog WHERE ts >= '2013-01-01 00:00:00+0200' AND  ts <= '2013-08-13 23:59:00+0200' AND token(user_id) > previous_token LIMIT 100 ALLOW FILTERING;

where user_id is the previous user_id returned. You will also need to page on ts to guarantee you get all the records for the last user_id returned.

Alternatively, in Cassandra 2.0.0 (just released), paging is done transparently so your original query should work with no timeout or manual paging.

The ALLOW FILTERING means Cassandra is reading through all your data, but only returning data within the range specified. This is only efficient if the range is most of the data. If you wanted to find records within e.g. a 5 minute time window, this would be very inefficient.

Richard
  • 11,050
  • 2
  • 46
  • 33
  • 33
    what would be efficient for a '5 minute time window' ? – nils petersohn Oct 24 '14 at 21:49
  • 1
    @nilspetersohn You have to use `ALLOW FILTERING` here because the partitioning key has not been limited. If you are doing a query for an individual `user_id` then you don't need `ALLOW FILTERING` and the query will be more efficient. You would have to know all the `user_id`s in the table though before hand. -- Also please note that when Richard said efficient for a large time window he did not mean fast. Filtering will be slow if you have a lot of data in the table no matter what. – Captain Man Jun 06 '18 at 18:29
7

It appears the hotness for being able to query by time (or any range) is to specify some "other column" as your Partition key, and then specify timestamp as a "clustering column"

CREATE TABLE postsbyuser (
     userid bigint,
     posttime timestamp,
     postid uuid,
     postcontent text,
     PRIMARY KEY ((userid), posttime)
   ) WITH CLUSTERING ORDER BY (posttime DESC);

insert fake data

  insert into postsbyuser (userid, posttime) values (77, '2013-04-03 07:04:00');

and query (the important part being that it is a "fast" query and ALLOW FILTERING is not required, which is how it should be):

  SELECT * FROM postsbyuser where userid=77 and posttime > '2013-04-03 07:03:00' and posttime < '2013-04-03 08:04:00';

You can also use tricks to group by day (and thus be able to query by day) or what not.

If you use the "group by day" style trick then a secondary index would also be an option (though secondary indexes seem to only work with "EQ" = operator?).

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
1

In general, this may be an indication that you've not modelled your schema to suit your data query, which is the Cassandra way of doing things (https://docs.datastax.com/en/cql/3.3/cql/ddl/dataModelingApproach.html)...

So, ideally, you'd model your schema to suit the query. There are some resources around on how to do time series modelling for Cassandra, although e.g. this slideshare seems to be similar to what you've got - but it's not advertising support for the kind of query you want to do. I don't think I've actually found examples of Cassandra schemas that support "get me all data for a certain time range" queries.

In any case, for the rest of this answer I'll assume you're stuck with the schema you've got for this iteration.

You can do this as two queries:

SELECT DISTINCT user_id FROM userlog;

And then, for each user,

SELECT * FROM userlog WHERE
  user_id='<user>'
  AND ts >= '2013-01-01 00:00:00+0200'
  AND ts <= '2013-08-13 23:59:00+0200';

If the set of user IDs is small to medium sized, you might be able to get away with using an IN query:

SELECT * FROM userlog WHERE
  user_id IN ('sampleuser', 'sampleadmin', ...)
  AND ts >= '2013-01-01 00:00:00+0200'
  AND ts <= '2013-08-13 23:59:00+0200';

Note that this works without ALLOW FILTERING.

m01
  • 9,033
  • 6
  • 32
  • 58