Does Spark 2.2.0 support Streaming Self-Joins?

Question

I understand JOINS of two different dataframes are not supported in Spark 2.2.0 but I am trying to do self-join so only one stream. Below is my code

val jdf = spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "join_test")
    .option("startingOffsets", "earliest")
    .load();

jdf.printSchema

which print the following

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)

Now I run the join query below after reading through this SO post

jdf.as("jdf1").join(jdf.as("jdf2"), $"jdf1.key" === $"jdf2.key")

And I get the following Exception

org.apache.spark.sql.AnalysisException: cannot resolve '`jdf1.key`' given input columns: [timestamp, value, partition, timestampType, topic, offset, key];;
'Join Inner, ('jdf1.key = 'jdf2.key)
:- SubqueryAlias jdf1
:  +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@f662b5,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#243, value#244, topic#245, partition#246, offset#247L, timestamp#248, timestampType#249]
+- SubqueryAlias jdf2
   +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@f662b5,kafka,List(),None,List(),None,Map(startingOffsets -> earliest, subscribe -> join_test, kafka.bootstrap.servers -> localhost:9092),None), kafka, [key#243, value#244, topic#245, partition#246, offset#247L, timestamp#248, timestampType#249]

score 0 · Answer 1 · answered Feb 18 '18 at 13:33

0

I think it will not create any difference if we try to join same streaming data frame or different dataframe.So, It will not be supported.

There are two ways to achieve it.

First, you can join static and streaming dataframe. So, read once as batch data and next as streaming df. The second solution, you can use Kafka streams. It provides joining of streaming data.

answered Feb 18 '18 at 13:33

Mahesh Chand

3,158
19
37

you maybe right but the error says something else. It says cannot resolve '`jdf1.key`'. It doesn't say JOINS not supported! – user1870400 Feb 18 '18 at 13:55
It might help you, https://stackoverflow.com/questions/46412884/workaround-for-joining-two-streams-in-structured-streaming-in-spark-2-x – Mahesh Chand Feb 18 '18 at 14:11
got it! but I would love to see UnsupportedOperationException...instead I am seeing something else in the above code. – user1870400 Feb 18 '18 at 14:14

Does Spark 2.2.0 support Streaming Self-Joins?

1 Answers1