When using the Scala standard lib, I can do somthing like this:
scala> val scalaList = List(1,2,3)
scalaList: List[Int] = List(1, 2, 3)
scala> scalaList.foldLeft(0)((acc,n)=>acc+n)
res0: Int = 6
Making one Int out of many Ints.
And I can do something like this:
scala> scalaList.foldLeft("")((acc,n)=>acc+n.toString)
res1: String = 123
Making one String out of many Ints.
So, foldLeft could be either homogeneous or heterogeneous, whichever we want, it's in one API.
While in Spark, if I want one Int out of many Ints, I can do this:
scala> val rdd = sc.parallelize(List(1,2,3))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:12
scala> rdd.fold(0)((acc,n)=>acc+n)
res1: Int = 6
The fold API is similar to foldLeft, but it is only homogeneous, a RDD[Int] can only produce Int with fold.
There is a aggregate API in spark too:
scala> rdd.aggregate("")((acc,n)=>acc+n.toString, (s1,s2)=>s1+s2)
res11: String = 132
It is heterogeneous, a RDD[Int] can produce a String now.
So, why are fold and aggregate implemented as two different APIs in Spark?
Why are they not designed like foldLeft that could be both homogeneous and heterogeneous?
(I am very new to Spark, please excuse me if this is a silly question.)