5

In the following code it appears to be that functions fn1 & fn2 are applied to inRDD in sequential manner as I see in the Stages section of Spark Web UI.

 DstreamRDD1.foreachRDD(new VoidFunction<JavaRDD<String>>()
 { 
     public void call(JavaRDD<String> inRDD)
        {
          inRDD.foreach(fn1)
          inRDD.foreach(fn2)
        }
 }

How is is different when streaming job is run this way. Are the below functions run in parallel on input Dstream?

DStreamRDD1.foreachRDD(fn1)
DStreamRDD2.foreachRDD(fn2)
Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
darkknight444
  • 546
  • 8
  • 21

1 Answers1

2

Both foreach on RDD and foreachRDD on DStream will run sequentially because they are output transformations, meaning they cause the materialization of the graph. This would not be the case for any general lazy transformation in Spark, which can run in parallel when the execution graph diverges into multiple separate stages.

For example:

dStream: DStream[String] = ???
val first = dStream.filter(x => x.contains("h"))
val second = dStream.filter(x => !x.contains("h"))

first.print()
second.print()

The first part need not execute sequentially when you have sufficient cluster resources to run underlying stages in parallel. Then, calling count, which again is an output transformation will cause the print statements to be printed one after the other.

Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321