I am using Spark Structured streaming to process data from Kafka. I transform each message to JSON. However, spark needs an explicit schema to obtain columns from JSON. Spark Streaming with DStreams
allows doing following
spark.read.json(spark.createDataset(jsons))
where jsons
is RDD[String]
.
In case of Spark Structured Streaming similar approach
df.sparkSession.read.json(jsons)
(jsons
is DataSet[String]
)
results to the following exception
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
I assume that read
triggers execution instead of start
, but is there a way to bypass this?