How to do JOIN on streamed data from kafka on spark streaming

Question

I am new to spark streaming. I am trying to do some exercises on fetching data from kafka and joining with hive table.i am not sure how to do JOIN in spark streaming (not the structured streaming). Here is my code

   val ssc = new StreamingContext("local[*]", "KafkaExample", Seconds(1))   

   val kafkaParams = Map[String, Object](
   "bootstrap.servers" -> "dofff2.dl.uk.feefr.com:8002",
   "security.protocol" -> "SASL_PLAINTEXT",
   "key.deserializer" -> classOf[StringDeserializer],
   "value.deserializer" -> classOf[StringDeserializer],
   "group.id" -> "1",
   "auto.offset.reset" -> "latest",
   "enable.auto.commit" -> (false: java.lang.Boolean)
   )

   val topics = Array("csvstream")
   val stream = KafkaUtils.createDirectStream[String, String](
   ssc,
   PreferConsistent,
   Subscribe[String, String](topics, kafkaParams)
   )

   val strmk = stream.map(record => (record.value,record.timestamp))

Now i want to do join on one of the table in hive. In spark structured streaming i can directly call spark.table("table nanme") and do some join, but in spark streaming how can i do it since its everything based on RDD. can some one help me ?

Need 1 help.. how can i add timestamp from kafka when split my value ? — BigD, Feb 06 '19 at 21:01
val rdd1 = strmk.map(line => line.split(',')).map(s => (s(0).toString, s(1).toString,s(2).toString,s(3).toString,s(4).toString, s(5).toString,s(6).toString,s(7).toString)) — BigD, Feb 06 '19 at 21:01
With kafka there are certain values you can get. Best to give input and outputs expected with the question. — thebluephantom, Feb 06 '19 at 21:19
val strmk = stream.map(record => (record.value,record.timestamp)) here iam getting timestmap from kafka.. my qn is how can put same timestamp during splitting ? — BigD, Feb 06 '19 at 21:57
@thebluephantom add question. can you help https://stackoverflow.com/questions/54563312/how-to-add-timestamp-from-kafka-to-spark-streaming-during-converting-to-df — BigD, Feb 06 '19 at 22:08

score 0 · Answer 1 · answered Feb 02 '19 at 11:07

You need transform.

Something like this is required:

val dataset: RDD[String, String] = ... // From Hive
val windowedStream = stream.window(Seconds(20))... // From dStream
val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }

From the manuals:

The transform operation (along with its variations like transformWith) allows arbitrary RDD-to-RDD functions to be applied on a DStream. It can be used to apply any RDD operation that is not exposed in the DStream API. For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. However, you can easily use transform to do this. This enables very powerful possibilities.

An example of this can be found here: How to join a DStream with a non-stream file?

The following guide helps: https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html

How to do JOIN on streamed data from kafka on spark streaming

1 Answers1