In the following code, is the BLOCK 2
loop guaranteed to be executed only after all the executor tasks spawned by BLOCK 1
have finished, or is it possible for it to run while some of the executor tasks are still running?
If it is possible for the 2 blocks to run concurrently, what is the best way to prevent this? I need to process the contents of the accumulator, but only once all the executors have finished.
When running with a master url of local[4]
as shown, it looks like BLOCK 2
waits for BLOCK 1
to finish, however I am seeing errors when running with a url of yarn
which suggest that BLOCK 2
is running concurrently with the executor tasks in BLOCK 1
object Main {
def main(args: Array[String]) {
val sparkContext = SparkSession.builder.master("local[4]").getOrCreate().sparkContext
val myRecordAccumulator = sparkContext.collectionAccumulator[MyRecord]
// BLOCK 1
sparkContext.binaryFiles("/my/files/here").foreach(_ => {
for(i <- 1 to 100000) {
val record = buildRecord()
myRecordAccumulator.add(record)
}
})
// BLOCK 2
myRecordAccumulator.value.asScala.foreach((record: MyRecord) => {
processIt(record)
})
}
}