In the following scala code the desired string is not printed

Question

Why does the accumulator variable in the following code does not print the aggregated string?


object mapRDD {
  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder()
      .master("local")
      .appName("sparkSessionName")
      .getOrCreate()

    spark.sparkContext.setLogLevel("WARN")

    val data = Seq("Project",
      "Gutenberg’s",
      "Alice’s",
      "Adventures",
      "in",
      "Wonderland")

    val rdd = spark.sparkContext.parallelize(data)

    var accumulator: String = "WHY IS THE AGGREGATED STRING NOT PRINTED?"
    for (eachElementOfRDD <- rdd) {
      accumulator = accumulator ++ eachElementOfRDD
    }
    println(accumulator)
  }
}

Output 21/05/13 09:10:52 INFO BlockManagerMasterEndpoint: Registering block manager host.docker.internal:64786 with 894.3 MB RAM, BlockManagerId(driver, host.docker.internal, 64786, None) 21/05/13 09:10:52 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, host.docker.internal, 64786, None) 21/05/13 09:10:52 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, host.docker.internal, 64786, None) WHY IS THE AGGREGATED STRING NOT PRINTED?

I am a newbie to spark and scala and I understand the output of the following code is correct. What I want to know is the reason for such a behaviour. What this concept is called and some pointers to understand it.

EDIT The variable name accumulator was a co-incidence to the accumulator functionality of spark. I am concerned with adding a string to the original string until the loop is completed.

score 1 · Answer 1 · answered May 13 '21 at 05:33

1

Two things should be changed in the code to get the expected output:

create an accumulator using SparkContext.collectionAccumulator
iterate over the elements of the rdd using RDD.foreach

val accumulator = spark.sparkContext.collectionAccumulator[String]
rdd.foreach( eachElementOfRDD => accumulator.add(eachElementOfRDD))
println(accumulator.value)

Output:

[Project, Gutenberg’s, Alice’s, Adventures, in, Wonderland]

answered May 13 '21 at 05:33

werner

13,518
6
30
45

My concern is that the code using for (eachElementOfRDD <- rdd) { accumulator = accumulator ++ eachElementOfRDD } the eachElementOfRDD gets added to accumulator variable But outside the loop its value is unchanged. – theAccidentalDeveloper May 13 '21 at 10:20
@mannat please have a look at [this answer](https://stackoverflow.com/a/67287430/2129801). Does this help? – werner May 13 '21 at 10:47

In the following scala code the desired string is not printed

1 Answers1