How to convert a Cassandra ResultSet to a Spark DataFrame?

Question

I would normally load data from Cassandra into Apache Spark this way using Java:

SparkContext sparkContext = StorakleSparkConfig.getSparkContext();

CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
    sqlContext.setKeyspace("midatabase");

DataFrame customers = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " +
            "WHERE CAST(store_id as string) = '" + storeId + "'");

But imagine I have a sharder and I need to load several partion keys into this DataFrame. I could use WHERE IN (...) in my query and again use the cassandraSql method. But I am a bit reluctant to use WHERE IN due to the infamous problem with having a one-point-of-failure in terms of the coordinator node. This is explained here:

https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/

Is there a way to use several queries but load them into a single DataFrame?

Alex Naspo · Answer 1 · 2016-01-21T18:13:27.957

1

One way to do this would be run individual queries and unionAll/union multiple DataFrames/RDDs.

SparkContext sparkContext = StorakleSparkConfig.getSparkContext();

CassandraSQLContext sqlContext = new CassandraSQLContext(sparkContext);
    sqlContext.setKeyspace("midatabase");

DataFrame customersOne = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId1 + "'");

DataFrame customersTwo = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM store_customer " + "WHERE CAST(store_id as string) = '" + storeId2 + "'");

DataFrame allCustomers = customersOne.unionAll(CustomersTwo)

edited Jan 21 '16 at 18:13

answered Jan 21 '16 at 17:24

Alex Naspo

2,052
1
20
37

1

Thanks for the answer! Yes I thought of this but was not sure about performance implications on the Spark side. Do you think there are any? – Milen Kovachev Jan 22 '16 at 11:56
@MilenKovachev Union is very efficient as it does not require any shuffle. However, be aware that it could potentially remove your partition. See here: http://stackoverflow.com/questions/29977526/in-apache-spark-why-does-rdd-union-does-not-preserve-partitioner – Alex Naspo Jan 22 '16 at 15:23
Assuming I have a variable number of keys that I need to retrieve, I will have to run the queries in a for loop. Is there a way to run the individual sqlContext.cassandraSql statements in parallel? – Milen Kovachev Jan 25 '16 at 10:18

How to convert a Cassandra ResultSet to a Spark DataFrame?

1 Answers1