Combining/Updating Cassandra Queried data to Structured Streaming receieved from Kafka

Question

I'm creating a Spark Structured streaming application which is going to be calculating data received from Kafka every 10 seconds.

To be able to do some of the calculations, I need to look up some information about sensors and placement in a Cassandra Database

I'm a little stuck at wrapping my head around how to keep the Cassandra data available throughout the cluster, and somehow update data from time to time, in-case we have done some changes to database table.

Currently, I'm querying the database as soon as I start the Spark locally using the Datastax Spark-Cassandra-connector

val cassandraSensorDf = spark
  .read
  .cassandraFormat("specifications", "sensors")
  .load

From here on I can use this cassandraSensorDs by joining it with my Structured Streaming Dataset.

.join(
   cassandraSensorDs ,
   sensorStateDf("plantKey") <=> cassandraSensorDf ("cassandraPlantKey")
)

How do I do additional queries to update this Cassandra data while having Structured Streaming Running? And how can I make the queried data available in a cluster setting?

score 2 · Answer 1 · answered Apr 20 '18 at 10:23

Using broadcast variables, you may write a wrapper to fetch data from Cassandra periodically and update a broadcast variable. Do a map-side join on the stream with the broadcast variable. I have not tested this approach and I think this might as well be an overkill depending on your use case(throughput).

How can I update a broadcast variable in spark streaming?

Another approach is to query Cassandra for every item in your stream, to optimise on the connections you should make sure that you use connection pooling and create only one connection for a JVM/partition. This approach is simpler you don't have to worry about warming the Cassandra data periodically.

spark-streaming and connection pool implementation

Thank you. Querying Cassandra for every item in the stream sounds like the way to go. If I would be able to both get data and save from the Cassandra database on every item while still piping it out to Kafka, it would resolve all of my current spark headaches. But the link you provided is targeting Spark Streaming, do you know if there is anything like this for Structured Streaming? — Martin, Apr 21 '18 at 09:38

Combining/Updating Cassandra Queried data to Structured Streaming receieved from Kafka

1 Answers1