I'm reading data from Kafka using spark streaming and passing to py file for prediction. It returns predictions as well as the original data. It's saving the original data with its predictions to file however it is creating a single file for each RDD. I need a single file consisting of all the data collected till the I stop the program to be saved to a single file.
I have tried writeStream it does not create even a single file. I have tried to save it to parquet using append but it creates multiple files that is 1 for each RDD. I tried to write with append mode still multiple files as output. The below code creates a folder output.csv and enters all the files into it.
def main(args: Array[String]): Unit = {
val ss = SparkSession.builder()
.appName("consumer")
.master("local[*]")
.getOrCreate()
val scc = new StreamingContext(ss.sparkContext, Seconds(2))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer"->
"org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer">
"org.apache.kafka.common.serialization.StringDeserializer",
"group.id"-> "group5" // clients can take
)
mappedData.foreachRDD(
x =>
x.map(y =>
ss.sparkContext.makeRDD(List(y)).pipe(pyPath).toDF().repartition(1)
.write.format("csv").mode("append").option("truncate","false")
.save("output.csv")
)
)
scc.start()
scc.awaitTermination()
I need to get just 1 file with all the statements one by one collected while streaming.
Any help will be appreciated, thank you in anticipation.