4

I working on spark 1.2.1 with datastax/spark-cassandra-connector and C* table filled with 1B+ rows (datastax-enterprise dse 4.7.0). I need to perform a range filter/where query on time stamp parameter.

What is the best way to do it without loading the whole 1B+ rows table to sparks memory (it could take hours to finish) and practically push the query back to C*?

Using rdd with JoinWithCassandraTable or using data frame with pushdown? Is there something else?

  • [Here](https://academy.datastax.com/fr/demos/datastax-enterprise-joining-tables-apache-spark) is a demo from DSE on how they perform join with C* and Spark. Ofc they don't use DataFrame because the feature was not available then. – eliasah Oct 26 '15 at 14:59
  • 1
    @eliasah In the link they perform join between two tables on id parameter... But I need to perform join between array of time stamps (as longs) and a huge table- range filter/where query. –  Oct 26 '15 at 15:08
  • 3
    DataFrames are available as of 1.4 which ships with the latest DSE. JoinWithCassandraTable is a good option in many cases. – phact Oct 26 '15 at 15:48
  • 1
    @phact I tried the JoinWithCassandraTable (with 1.2.1) and it didnt work as I discribed here: http://stackoverflow.com/questions/33329494/spark-joinwithcassandratable-on-timestamp-partition-key-stuck. Maybe I need to upgrade to 1.4... –  Oct 26 '15 at 15:50
  • You might also check [this](https://github.com/Stratio/cassandra-lucene-index) out. – rs_atl Oct 26 '15 at 18:46

1 Answers1

1

JoinWithCassandraTable turned to be the best solution in my case. I have learned a lot from this post: http://www.datastax.com/dev/blog/zen-art-spark-maintenance and post an answer to the linked question: Spark JoinWithCassandraTable on TimeStamp partition key STUCK

It is all about building your C* table in the right way (extra important to choose good partition keys) for your future queries.

Community
  • 1
  • 1