Cassandra schema - select by frequently updated column

Question

Given table:

CREATE TABLE T (
      a int,
      last_modification_time timestamp,
      b int,
     PRIMARY KEY (a)
);

I'm frequently updating records. With each update last_modification_time is set to now() and also other fields are set.

What is the right cassandra approach to be able to query by last_modification_time range? I need to query like this:

select * from .. where a=Z and last_modification_time < X and last_modification_time > Y;

One way would be to create materialized view with PRIMARY KEY (a, last_modification_time) but I want to avoid this since materialized views are buggy in 3.X cassandra versions.

What would be alternative way of querying by last_modification_time range given last_modification_time is frequently updated?

your schema will result in wide rows... if you cannot change data model and will be always using partition key try secondary index... https://stackoverflow.com/questions/35719985/range-query-on-secondary-index-in-cassandra — undefined_variable, Oct 05 '17 at 06:03
@undefined_variable could you clarify what do you mean by 'wide row'? — Michal, Oct 05 '17 at 13:04
Your best alternative is probably to use Postgres to support a query model like this. — Aaron, Oct 05 '17 at 13:06
In this case last_modification_time is updated around 20 times then record becomes stable. Given this constraint do you see any data model supporting range queries on last_modification_time? — Michal, Oct 05 '17 at 13:15
@undefined_variable secondary indexes wont work - range queries are not allowed on indexed columns — Michal, Oct 05 '17 at 15:01
Another approach would be to use last_modification_time_one_hour_bucket instead of last_modification_time, then delete record if last_modification_time fits into new bucket and insert a record with new bucket — Michal, Oct 05 '17 at 20:26
@MichalWegorek Wide row was my mistake... But according to link provided cassandra 3.x does provide inequality clause on secondary index (Not tested myself) — undefined_variable, Oct 06 '17 at 05:58
@undefined_variable Inequality is not enough to do range queries. I tested 3.9 cassandra, range queries on indexed columns and not supported — Michal, Oct 06 '17 at 14:26

score 0 · Answer 1 · answered Oct 09 '17 at 21:26

How about having two tables? One could hold the current snapshot where you're updating the last_modification_time field and another one which holds the changes over time (something like a history table)? You could write to both of them using BATCH statements.

CREATE TABLE t_modifications (
      a int,
      last_modification_time timestamp,
      b int,
     PRIMARY KEY (a, last_modification_time)
) WITH CLUSTERING ORDER BY (last_modificaton_time DESC);

BEGIN BATCH
      UPDATE T SET last_modification_time = 123, b = 4 WHERE a = 2;
      INSERT INTO t_modifications (a, last_modification_time, b) values (2, 123, 4);
APPLY BATCH;

If you're interested on the latest snapshot by a given modification range, you can select and limit the t_modifications table:

SELECT * FROM t_modifications WHERE a = 2 AND last_modification_time < 136 LIMIT 1;

score 0 · Answer 2 · answered Sep 10 '20 at 01:30

In general, to do range queries like this, the field you want to range on has to be part of the composite key, has to be the right-most element of the composite key, and all other elements in the composite key have to be specified. In your case, you would modify your PRIMARY KEY to (a, last_modification_time). You can then

SELECT * from t_modifications 
 WHERE a = aval
   AND last_modification_time > beg
   AND last_modification_time < end;

This will get you all records for aval between beg and end.

Cassandra schema - select by frequently updated column

2 Answers2