1

I have been trying to model a data in Cassandra, and was trying to filter the data based on date in that, as given by the answer here on SO, Here second answer is not using allow filter.

This is my current schema,

CREATE TABLE Banking.BankData(acctID TEXT, 
    email TEXT, 
    transactionDate Date , 
    transactionAmount double ,
    balance DOUBLE, 
    currentTime timestamp , 
    PRIMARY KEY((acctID, transactionDate), currentTime ) 
WITH CLUSTERING ORDER BY (currentTime DESC);

Now have inserted a data by

INSERT INTO banking.BankData(acctID, email, transactionDate, transactionAmount, balance, currentTime) values ('11', 'alpitanand20@gmail.com','2013-04-03',10010, 10010, toTimestamp(now()));

Now when I try to query, like

SELECT * FROM banking.BankData WHERE acctID = '11' AND transactionDate >  '2012-04-03';

It's saying me to allow filtering, however in the link mentioned above, it was not the case. The final requirement was to get data by year, month, week and so on, thats why had taken to partition it by day, but date range query is not working. Any suggestion in remodel or i am doing something wrong ? Thanks

Aaron
  • 55,518
  • 11
  • 116
  • 132
Alpit Anand
  • 1,213
  • 3
  • 21
  • 37

1 Answers1

2

Cassandra supports only equality predicate on the partition key columns, so you can use only = operation on it.

Range predicates (>, <, >=, <=) are supported only only on the clustering columns, and it should be a last clustering column of condition.

For example, if you have following primary key: (pk, c1, c2, c3), you can have range predicate as following:

  • where pk = xxxx and c1 > yyyy
  • where pk = xxxx and c1 = yyyy and c2 > zzzz
  • where pk = xxxx and c1 = yyyy and c2 = zzzz and c3 > wwww

but you can't have:

  • where pk = xxxx and c2 > zzzz
  • where pk = xxxx and c3 > zzzz

because you need to restrict previous clustering columns before using range operation.

If you want to perform a range query on this data, you need to declare corresponding column as clustering column, like this:

PRIMARY KEY(acctID, transactionDate, currentTime ) 

in this case you can perform your query. But because you have time component, you can simply do:

PRIMARY KEY(acctID, currentTime ) 

and do the query like this:

SELECT * FROM banking.BankData WHERE acctID = '11' 
   AND currentTime >  '2012-04-03T00:00:00Z';

But you need to take 2 things into consideration:

  1. your primary should be unique - maybe you'll need to add another clustering column, like, transaction ID (for example, as uuid type) - in this case even 2 transactions happen into the same millisecond, they won't overwrite each other;
  2. if you have a lot of transactions per account, then you may need to add an another column into partition key. For example, year, or year/month, so you don't have big partitions.

P.S. In linked answer use of non-equality operation is possible because ts is clustering column.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thanks @Alex, but how then how can we have ranged query with date, any hint toward that, like i want the data between some duration of the same accountID? How to handle that ? – Alpit Anand Apr 07 '20 at 08:29
  • Thanx this was indeed helpful – Alpit Anand Apr 07 '20 at 08:44
  • Thats what i was trying to do as per your second point, but since range query was not possible with partition key, thats why such question – Alpit Anand Apr 07 '20 at 08:50
  • so if a user want to query a data like, to know a balace, he have to know the last transaction month, otherwise wont be possible right ? – Alpit Anand Apr 07 '20 at 08:57
  • yes, you'll need to know the month to do a query if you address the 2nd item... But if you're writing this as a code for API - it shouldn't be a problem - just add corresponding column – Alex Ott Apr 07 '20 at 09:14
  • I am really sorry, i didnt get it, like i as a API creator, also have to fetch the last month his transaction happened to give him the balance information, i.e either one of us have to have track of that, and what do you mean by "it shouldn't be a problem - just add the corresponding column", can you please elaborate it more ? Just bear with me for this – Alpit Anand Apr 07 '20 at 09:22
  • I meant that if you add month component into partition key, then you know the value that you want to handle - it's either current month, or previous month, or month for which data selection is done. In this case - you just generate the corresponding value & send query. If you have overlapping months, like, 30 days - just send 2 queries - for previous month + current month & filter on both of them, and merge results for showing the balance. – Alex Ott Apr 07 '20 at 10:36
  • Thanks Alex, it helped, but the approach i took as denormalization is prefferd, i made a different table with `CREATE TABLE banking.BankInformation(acctID TEXT, balance DOUBLE, lastTransaction Date, PRIMARY KEY(acctID));`, will give balance from here and if logs is asked, will return the amount of logs asked from the table i posted, have opted for day partition, assuming the account is a Company account with lots of transaction, hope it scales well. and thanks again – Alpit Anand Apr 07 '20 at 10:52
  • However will have to run compaction much often on this table, but lets see – Alpit Anand Apr 07 '20 at 10:54