SparkSQL restrict queries by Cassandra partition key ranges

Question

Imagine that my primary key is a timestamp.

I would like to restrict the query by timestamp ranges.

I don't seem to manage to make it work, even if I used token(). Also I can't create a secondary index on the partition key.

How should this be done?

Next time please invest a bit in searching SO first, you'll usually find an answer http://stackoverflow.com/questions/13700288/timestamp-date-as-key-for-cassandra-column-family-hector — Moshe Eshel, Mar 14 '16 at 11:30

score 2 · Answer 1 · edited May 23 '17 at 11:50

2

Cassandra doesn't allow for range queries on partition key.

One way of dealing with this problem is changing your schema so that your timestamp value would be a clustering column. For this to work, you need to introduce a sentinel column as partition key. See this question for more detailed answers: Range Queries in Cassandra (CQL 3.0)

Another way is just to let Spark do the filtering. Range queries on primary key should work in Spark SQL. They would simply not be pushed down to Cassandra and Spark would fetch all data and filter them on the Spark side.

edited May 23 '17 at 11:50

Community

1
1

answered Mar 14 '16 at 12:00

Piotr Kołaczkowski

2,601
12
14

That makes sense. Can the SparkSQL filtering be expressed as a `where` clause even though it won't be pushed down? – Cedric H. Mar 14 '16 at 12:36
One more question: doesn't Cassandra allow range queries on the partition key via the `token()`? That's what Spark is doing in the background right? – Cedric H. Mar 14 '16 at 12:39

SparkSQL restrict queries by Cassandra partition key ranges

1 Answers1