Creating table with cluster keys that support order by

Question

I have a requirement to create a table with id, creation date, time and other fields. I need to query some 'n' records based on latest creation date and time. In order to achieve this, how the table should be? Creation with id as partition key, and date and time as cluster keys work? Is it possible to do order by for all ids i.e., without any condition on partition key?

This link may provide some insight about the concept https://stackoverflow.com/questions/24949676/difference-between-partition-key-composite-key-and-clustering-key-in-cassandra — Ersoy, May 12 '20 at 10:54

score 0 · Accepted Answer · answered May 13 '20 at 14:44

Creation with id as partition key, and date and time as cluster keys work?

Technically you could, but the performance would be abysmal. Using id as the partition key would put all rows on different nodes, regardless of timestamp. As you want to query by a time range, you should also store your data by a time component.

Is it possible to do order by for all ids i.e., without any condition on partition key?

No, not at all possible. Cassandra is designed to perform sequential reads. Therefore, result set ordering in Cassandra is dependent on data being already stored in the order that you want to retrieve it.

This is how I would do it:

CREATE TABLE events (
  id UUID,
  month_bucket INT,
  creation TIMESTAMP,
  name TEXT,
  PRIMARY KEY ((month_bucket),creation,id))
WITH CLUSTERING ORDER BY (creation DESC, id ASC);

After inserting some data, this query works:

aploetz@cqlsh:stackoverflow> SELECT * FROM events
    WHERE month_bucket=202005 
    AND creation >= '2020-05-01' AND creation < '2020-05-13';

 month_bucket | creation            | id          | name
--------------+---------------------+-------------+----------------
       202005 | 2020-05-05 05:00:00 | 1d073ed5... |  Cinco de Mayo
       202005 | 2020-05-04 05:00:00 | 3cca75a0... | May the Fourth
       202005 | 2020-05-01 05:00:00 | fd2bb58f... |        May Day

(3 rows)

This utilizes a modeling technique known as "bucketing." Since I'm not sure of the number of entries that can happen for a given time period (only you know that), I guessed and used "month" as an example. Basically, you'll want to choose a time "bucket" that is both:

big enough to be useful (provide the queried data)
small enough to stay under Cassandra's 2GB/2 billion cells per partition requirement.

Then, provide the bucketing value with the query, along with a range component on creation.

The CLUSTERING ORDER definition keeps the data stored by creation in a descending order. id is added to the end as a way to ensure uniqueness. I've used this pattern to solve range query models on date/time components several times.

Creating table with cluster keys that support order by

1 Answers1