How to create dataframe inside ForeachWriter[Row]

Question

I have a streaming query that I'm reading from Kafka as the source. I want to perform some logic on each batch that I receive from the stream. Here's how I have done it so far

val streamDF = spark
               .readStream
               ...
               .load()

//val bc = spark.sparkContext.broadcast(spark)

streamDF
     .writeStream
     .foreach( new ForeachWriter[Row] {
             def open(partitionId: Long, version: Long): Boolean = {true}
            
             def process(record: String) = {
                         val aRDD = spark.sparkContext.parallelize(Seq('a','b','C'))
                         val aDF = spark.createDataframe(aRDD)
                         //val aDF = bc.vlaue.createDataframe(aRDD)
                 
                             // do something with aDF
               }

             def close(errorOrNull: Throwable): Unit = {}
  }
).start()

I'm using Spark 2.3.2 so I'm stuck with ForeachWriter (I cannot use foreachBatch, this would've made my life simpler). I'm also aware that the foreach() performs on executors. So, keeping that in mind, I broadcasted sparkSession to all the executors. But that did not help either. This is the commented part of the code snippet.

I'm looking for a solution to process data as dataframe inside foreach in Spark 2.3.2 (I have to use dataframe/datasets as the operations are pretty heavy.. they include actions as well)

I found a similar question but there is no response on it --> similar q

No is also an answer. – thebluephantom Jun 30 '21 at 07:51 — thebluephantom, Jun 30 '21 at 07:51

score 0 · Answer 1 · answered Jun 24 '21 at 13:23

0

Sorry, well not really, but NOT possible to create dataframe on an Executor.

A dataframe is a distributed collection in Spark. They are only able to be created on Driver node or via Transformation (via Actions) in your Spark App.

answered Jun 24 '21 at 13:23

thebluephantom

16,458
8
40
83

Yes.. that's right. Is there any alternate to process each batch as a regular dataframe? (not streaming dataframe) – underwood Jun 24 '21 at 15:04
The guide gives examples on what to do – thebluephantom Jun 24 '21 at 15:21
why can you not use foreachbatch? – thebluephantom Jun 24 '21 at 15:30
2

As mentioned in the question, I'm using Spark 2.3.2 that does not have foreachbatch (this was introduced in 2.4). I'm working on a workaround., If that works, I will post the solution here :) #fingerscrossed – underwood Jun 24 '21 at 16:12
yes but an odd reasoning then to request a DF to be created against the spirit of Spark. may be an idea to state what you want to achieve. – thebluephantom Jun 24 '21 at 16:41
https://stackoverflow.com/questions/46820160/spark-structured-streaming-foreachwriter-and-database-performance – thebluephantom Jun 24 '21 at 19:07
u need to process as iterator – thebluephantom Jun 24 '21 at 19:07

How to create dataframe inside ForeachWriter[Row]

1 Answers1