1

Using Spark, I filter and transform a collection. Then I want to count the size of the result collection and save the result collection to a file. So, if the result collection does not fit in memory, does this mean that output will be computed twice? Is there a way to count and saveAsObjectFile at the same time, so that it is not computed twice?

val input: RDD[Page] = ...
val output: RDD[Result] = input.filter(...).map(...)  // expensive computation
output.cache()
val count = output.count
output.saveAsObjectFile("file.out")
David Portabella
  • 12,390
  • 27
  • 101
  • 182
  • Possible duplicate of [Spark: how to get the number of written rows?](http://stackoverflow.com/questions/37496650/spark-how-to-get-the-number-of-written-rows) –  Nov 08 '16 at 11:45

1 Answers1

2

Solution #1 using cache to memory and disk

You can use cache to memory and disk - you'll avoid computing it twice, but you'll have to read the data from disk (instead of RAM)

using persist() with MEMORY_AND_DISK as parameter. This will save the computed data to memory or disk

http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose

MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.

Solution #2 perform the count using accumulator

similar question was asked/answer here: http://thread.gmane.org/gmane.comp.lang.scala.spark.user/7920

With the suggestion to use accumulator, which will be apply before applying saveAsObjectFile()

val counts_accum = sc.longAccumulator("count Accumulator")
output.map{x =>
  counts_accum.add(1)
  x
}.saveAsObjectFile("file.out")

After the saveAsObjectFile will be completed, the accumulator value will hold the total count, and you'll be able to print it (you'll have to use ".value" in order to get the accumulator value)

println(counts_accum.value)

If accumulators are created with a name, they will be displayed in Spark’s UI. This can be useful for understanding the progress of running stages

More info can be found here: http://spark.apache.org/docs/latest/programming-guide.html#accumulators

Yaron
  • 10,166
  • 9
  • 45
  • 65