How to Model Table to run query's based on status field which is changing over time

Question

Currently we have a table which we query using shipment_id and in future we have a need to query based on status field Current Table :

CREATE TABLE shipment ( 
    shipment_id text,
    tenant_id text,
    actual_arrival_time text,
    actual_dep_time text,
    email_ids set,
    is_deleted boolean,
    modified_by text,
    modified_time timestamp,
    planned_arrival_time text,
    planned_dep_time text,
    route_id text,
    shipment_departure_date text,
    status_code text,
    PRIMARY KEY (shipment_id, tenant_id) 
); 

CREATE INDEX shipment_id_index ON shipment (tenant_id);

Current Query's

1) SELECT * FROM shipment where tenant_id=?0 ALLOW FILTERING ;

2) SELECT * FROM shipment WHERE shipment_id=?0 and tenant_id=?1 ;

Pending/Future Query's

list of shipment id's for given status code as of now 3) SELECT * FROM shipment WHERE tenant_id = 'y' and status_code = x ? ;

4) list of shipment id's for given status code for last 1 week

5) list of shipment id's for which got delayed

Above table There could be 10-15 unique tenants and will have 1 shipment_id,1 tenant_id 1 row per table and status_code will be changing over the time as shipment progresses , from Shipment_started, shipment_progress, shipment_delayed, shipment_delayed_completed and shipment_completed etc each shipment in its life time will go through 3-5 statuses , the current table will be updated only when there is a status change for a given shipment_id .

I need to create a new table which can address query's something like below

3) list of shipment's for a given tenant which have status_code = 'x' as of now

4) list of shipment's for a given tenant which have status_code = 'x' for last 1 week

5) list of shipment's for which got delayed ?

Mohamed Ibrahim Elsayed · Accepted Answer · 2018-12-22T15:16:55.970

In Cassandra, you model your tables based on your queries, so you may actually create a table for every query you may execute. Also using ALLOW FILTERING in your queries is something that should be used for development and testing purposes only and not in your actual prodcution application (check the answer here: Cassandra CQLEngine Allow Filtering).

So for each of the cases/queries you mentioned I suggest the following:

1) SELECT * FROM shipment where tenant_id=?0 ALLOW FILTERING;

this should be addressed by the following table:

CREATE TABLE shipment ( 
    tenant_id text,
    shipment_id text,
    actual_arrival_time text,
    actual_dep_time text,
    email_ids set,
    is_deleted boolean,
    modified_by text,
    modified_time timestamp,
    planned_arrival_time text,
    planned_dep_time text,
    route_id text,
    shipment_departure_date text,
    status_code text,
    PRIMARY KEY (tenant_id, shipment_id) 
);

here the tenant_id is the partition key so if you execute your query: SELECT * FROM shipment where tenant_id='x'; then you don't need to use ALLOW FILTERING any more.

Update: I've also added the shipment_id as part of the primary key to handle same cardinality in case the tenant_id is not unique so that the primary key be consisted of both tenant_id and shipment_id to avoid overwriting records with same tenant_id As per @Himanshu Ahire's comment.

2)SELECT * FROM shipment WHERE shipment_id='x' and tenant_id='y';

this should be addressed by the following table:

CREATE TABLE shipment ( 
    shipment_id text,
    tenant_id text,
    actual_arrival_time text,
    actual_dep_time text,
    email_ids set,
    is_deleted boolean,
    modified_by text,
    modified_time timestamp,
    planned_arrival_time text,
    planned_dep_time text,
    route_id text,
    shipment_departure_date text,
    status_code text,
    PRIMARY KEY ((shipment_id, tenant_id)) 
);

here the shipment_id and tenant_id both are used as composite partition key

3) SELECT * FROM shipment WHERE tenant_id = 'y' and status_code = 'x';

4) list of shipment id's for given status code for last 1 week

5) list of shipment id's for which got delayed

these should be addressed by the following table:

CREATE TABLE shipment (
    status_code text,
    tenant_id text,
    shipment_id text,
    actual_arrival_time text,
    actual_dep_time text,
    email_ids set,
    is_deleted boolean,
    modified_by text,
    modified_time timestamp,
    planned_arrival_time text,
    planned_dep_time text,
    route_id text,
    shipment_departure_date text,
    PRIMARY KEY ((tenant_id, status_code), actual_arrival_time) 
) WITH CLUSTERING ORDER BY (actual_arrival_time DESC);

here you should also use both tenant_id and status_code as composite partition key and the actual_arrival_time as clustering column so you can easily create queries like:

3) SELECT * FROM shipment WHERE tenant_id = 'y' and status_code = 'x';

4) SELECT * FROM shipment WHERE tenant_id = 'y' and status_code = 'x' and actual_arrival_time >= 'date of last week';

5) SELECT * FROM shipment WHERE tenant_id = 'y' and status_code = 'x' and actual_arrival_time > planned_arrival_time;

just a note for query number 4 you can send the date of last week from your application code or using cql functions

Follow up question: for table primarykey (tenant_id) how to maintain same cardinality ? Because shipment_id + tenant_id will create unique row. Which is not the case for single tenant_id key this may lead to overwrite record . Thanks — Himanshu Ahire, Dec 22 '18 at 07:57
In case the `tenant_id` is not unique as you say, we can make a slight modification to the table's `primary key` by making the `shipment_id` as `clustering key`, this will make the primary key be consisted of both `tenant_id` and `shipment_id` so we guarantee that it is unique while keeping the `tenant_id` as `partition key` as intended. — Mohamed Ibrahim Elsayed, Dec 22 '18 at 15:11
As OP said 'status_code will be changing over the time as shipment progresses'. You didn't mention about how update will work. PRIMARY KEY ((tenant_id, status_code), actual_arrival_time) works if status_code does not update. PRIMARY KEY ((shipment_id, tenant_id)) why both key have to used as partition key?What If OP needs query by only shipment_id? — Chaity, Dec 23 '18 at 06:32
Concerning your first point about status_code update, well that's a valid case. But as I mentioned in the answer, the table structure supports queries like `select * from shipment where tenant_id='x' and status_code='y'`, so if the at any point we want to do an update query for the status_code, why not just `insert` a new row instead with the same `tenant_id` and a new `status_code` instead of doing an `update` and we can always execute the same query. — Mohamed Ibrahim Elsayed, Dec 23 '18 at 22:53
Concerning your second point, the OP didn't mention that he wants to query using `shipment_id` only, but if so then a new table with `shipment_id` as partition key should be created. — Mohamed Ibrahim Elsayed, Dec 23 '18 at 22:54
@Mis94 thanks so its safe to assume primary key cant be updated (not inserted) ? — Himanshu Ahire, Dec 25 '18 at 05:23
@HimanshuAhire I can't get your question...if you mean "Can we update columns of primary key?" Then the answer is No. As mentioned here: https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlUpdate.html UPDATE cannot update the values of a row's primary key fields. If you are in a case that you must update the columns of primary key then you can follow the technique mentioned in the answer here: https://stackoverflow.com/questions/27022325/cassandra-update-primary-key-value — Mohamed Ibrahim Elsayed, Dec 25 '18 at 11:28
Thanks @Mis94 +1 . I was thinking Simple way would be to just insert new row and delete existing row to change primary key . — Himanshu Ahire, Dec 28 '18 at 05:41

How to Model Table to run query's based on status field which is changing over time

1 Answers1