Getting error saying "Queries with streaming sources must be executed with writeStream.start()" on spark structured streaming

Question

I am getting some issues while executing spark SQL on top of spark structures streaming. PFA for error.

here is my code

 object sparkSqlIntegration {
    def main(args: Array[String]) {
     val spark = SparkSession
         .builder
         .appName("StructuredStreaming")
         .master("local[*]")
         .config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
         .config("spark.sql.streaming.checkpointLocation", "file:///C:/checkpoint")
         .getOrCreate()

       setupLogging()
         val userSchema = new StructType().add("name", "string").add("age", "integer")
       // Create a stream of text files dumped into the logs directory
       val rawData =  spark.readStream.option("sep", ",").schema(userSchema).csv("file:///C:/Users/R/Documents/spark-poc-centri/csvFolder")

       // Must import spark.implicits for conversion to DataSet to work!
       import spark.implicits._
      rawData.createOrReplaceTempView("updates")
       val sqlResult= spark.sql("select * from updates")
       println("sql results here")
       sqlResult.show()
       println("Otheres")
       val query = rawData.writeStream.outputMode("append").format("console").start()

       // Keep going until we're stopped.
       query.awaitTermination()

       spark.stop()

    }
 }

During execution, I am getting the following error. As I am new to streaming can anyone tell how can I execute spark SQL queries on spark structured streaming

2018-12-27 16:02:40 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, LAPTOP-5IHPFLOD, 6829, None)
2018-12-27 16:02:41 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6731787b{/metrics/json,null,AVAILABLE,@Spark}
sql results here
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[file:///C:/Users/R/Documents/spark-poc-centri/csvFolder]
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:37)
    at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
    at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
    at scala.collection.immutable.List.foreach(List.scala:392)

OneCricketeer · Answer 1 · 2018-12-28T22:10:24.543

You don't need any of these lines

import spark.implicits._
rawData.createOrReplaceTempView("updates")
val sqlResult= spark.sql("select * from updates")
println("sql results here")
sqlResult.show()
println("Otheres")

Most importantly, select * isn't needed. When you print the dataframe, you would already see all the columns. Therefore, you also don't need to register the temp view to give it a name.

And when you format("console"), that eliminates the need for .show()

Refer to the Spark examples for reading from a network socket and output to console.

val words = // omitted ... some Streaming DataFrame

// Generating a running word count
val wordCounts = words.groupBy("value").count()

// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
  .outputMode("complete")
  .format("console")
  .start()

query.awaitTermination()

Take away - use DataFrame operations like .select() and .groupBy() rather than raw SQL

Or you can use Spark Streaming, as shown in those examples, you need to foreachRDD over each stream batch, then convert these to a DataFrame, which you can query

/** Case class for converting RDD to DataFrame */
case class Record(word: String)

val words = // omitted ... some DStream

// Convert RDDs of the words DStream to DataFrame and run SQL query
words.foreachRDD { (rdd: RDD[String], time: Time) =>
  // Get the singleton instance of SparkSession
  val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
  import spark.implicits._

  // Convert RDD[String] to RDD[case class] to DataFrame
  val wordsDataFrame = rdd.map(w => Record(w)).toDF()

  // Creates a temporary view using the DataFrame
  wordsDataFrame.createOrReplaceTempView("words")

  // Do word count on table using SQL and print it
  val wordCountsDataFrame =
    spark.sql("select word, count(*) as total from words group by word")
  println(s"========= $time =========")
  wordCountsDataFrame.show()
}

ssc.start()
ssc.awaitTermination()

Getting error saying "Queries with streaming sources must be executed with writeStream.start()" on spark structured streaming

1 Answers1

Linked