Filter on the C* side - push down filter/where range queries to C* from Spark

Question

I working on spark 1.2.1 with datastax/spark-cassandra-connector and C* table filled with 1B+ rows (datastax-enterprise dse 4.7.0). I need to perform a range filter/where query on time stamp parameter.

What is the best way to do it without loading the whole 1B+ rows table to sparks memory (it could take hours to finish) and practically push the query back to C*?

Using rdd with JoinWithCassandraTable or using data frame with pushdown? Is there something else?

[Here](https://academy.datastax.com/fr/demos/datastax-enterprise-joining-tables-apache-spark) is a demo from DSE on how they perform join with C* and Spark. Ofc they don't use DataFrame because the feature was not available then. — eliasah, Oct 26 '15 at 14:59
@eliasah In the link they perform join between two tables on id parameter... But I need to perform join between array of time stamps (as longs) and a huge table- range filter/where query. — , Oct 26 '15 at 15:08
DataFrames are available as of 1.4 which ships with the latest DSE. JoinWithCassandraTable is a good option in many cases. — phact, Oct 26 '15 at 15:48
@phact I tried the JoinWithCassandraTable (with 1.2.1) and it didnt work as I discribed here: http://stackoverflow.com/questions/33329494/spark-joinwithcassandratable-on-timestamp-partition-key-stuck. Maybe I need to upgrade to 1.4... — , Oct 26 '15 at 15:50
You might also check [this](https://github.com/Stratio/cassandra-lucene-index) out. — rs_atl, Oct 26 '15 at 18:46

score 1 · Accepted Answer · edited May 23 '17 at 11:44

JoinWithCassandraTable turned to be the best solution in my case. I have learned a lot from this post: http://www.datastax.com/dev/blog/zen-art-spark-maintenance and post an answer to the linked question: Spark JoinWithCassandraTable on TimeStamp partition key STUCK

It is all about building your C* table in the right way (extra important to choose good partition keys) for your future queries.

Filter on the C* side - push down filter/where range queries to C* from Spark

1 Answers1