3

I would normally load data from Cassandra into Apache Spark this way using Java:

SparkContext sparkContext = StorakleSparkConfig.getSparkContext();

CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
    sqlContext.setKeyspace("midatabase");

DataFrame customers = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " +
            "WHERE CAST(store_id as string) = '" + storeId + "'");

But imagine I have a sharder and I need to load several partion keys into this DataFrame. I could use WHERE IN (...) in my query and again use the cassandraSql method. But I am a bit reluctant to use WHERE IN due to the infamous problem with having a one-point-of-failure in terms of the coordinator node. This is explained here:

https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

Is there a way to use several queries but load them into a single DataFrame?

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
Milen Kovachev
  • 5,111
  • 6
  • 41
  • 58

1 Answers1

1

One way to do this would be run individual queries and unionAll/union multiple DataFrames/RDDs.

SparkContext sparkContext = StorakleSparkConfig.getSparkContext();

CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
    sqlContext.setKeyspace("midatabase");

DataFrame customersOne = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId1 + "'");

DataFrame customersTwo = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId2 + "'");

DataFrame allCustomers = customersOne.unionAll(CustomersTwo)
Alex Naspo
  • 2,052
  • 1
  • 20
  • 37
  • 1
    Thanks for the answer! Yes I thought of this but was not sure about performance implications on the Spark side. Do you think there are any? – Milen Kovachev Jan 22 '16 at 11:56
  • @MilenKovachev Union is very efficient as it does not require any shuffle. However, be aware that it could potentially remove your partition. See here: http://stackoverflow.com/questions/29977526/in-apache-spark-why-does-rdd-union-does-not-preserve-partitioner – Alex Naspo Jan 22 '16 at 15:23
  • Assuming I have a variable number of keys that I need to retrieve, I will have to run the queries in a for loop. Is there a way to run the individual sqlContext.cassandraSql statements in parallel? – Milen Kovachev Jan 25 '16 at 10:18