multiple writeStream with spark streaming

Question

I am working with spark streaming and I am facing some issues trying to implement multiple writestreams. Below is my code

DataWriter.writeStreamer(firstTableData,"parquet",CheckPointConf.firstCheckPoint,OutputConf.firstDataOutput)
DataWriter.writeStreamer(secondTableData,"parquet",CheckPointConf.secondCheckPoint,OutputConf.secondDataOutput)
DataWriter.writeStreamer(thirdTableData,"parquet", CheckPointConf.thirdCheckPoint,OutputConf.thirdDataOutput)

where writeStreamer is defined as follows :

def writeStreamer(input: DataFrame, checkPointFolder: String, output: String) = {

  val query = input
                .writeStream
                .format("orc")
                .option("checkpointLocation", checkPointFolder)
                .option("path", output)
                .outputMode(OutputMode.Append)
                .start()

  query.awaitTermination()
}

the problem I am facing is that only the first table is written with spark writeStream , nothing happens for all other tables . Do you have any idea about this please ?

bp2010 · Accepted Answer · 2018-07-18T14:21:17.437

query.awaitTermination() should be done after the last stream is created.

writeStreamer function can be modified to return a StreamingQuery and not awaitTermination at that point (as it is blocking):

def writeStreamer(input: DataFrame, checkPointFolder: String, output: String): StreamingQuery = {
  input
    .writeStream
    .format("orc")
    .option("checkpointLocation", checkPointFolder)
    .option("path", output)
    .outputMode(OutputMode.Append)
    .start()
}

then you will have:

val query1 = DataWriter.writeStreamer(...)
val query2 = DataWriter.writeStreamer(...)
val query3 = DataWriter.writeStreamer(...)

query3.awaitTermination()

score 1 · Answer 2 · answered Feb 24 '20 at 13:13

1

If you want to execute writers to run in parallel you can use

sparkSession.streams.awaitAnyTermination()

and remove query.awaitTermination() from writeStreamer method

answered Feb 24 '20 at 13:13

Shreyansh

11
1

score 0 · Answer 3 · answered Jul 18 '18 at 11:42

0

By default the number of concurrent jobs is 1 which means at a time only 1 job will be active

did you try increase number of possible concurent job in spark conf ?

sparkConf.set("spark.streaming.concurrentJobs","3")

not a offcial source : http://why-not-learn-something.blogspot.com/2016/06/spark-streaming-performance-tuning-on.html

answered Jul 18 '18 at 11:42

maxime G

1,660
1
10
27

I removed the await termination from the writer , and now I am getting a new error : ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: Aborting job null. org.apache.spark.SparkException: Job 1 cancelled because SparkContext was shut down – scalacode Jul 18 '18 at 12:57

multiple writeStream with spark streaming

3 Answers3

Linked