3

foldByKey,aggregateByKey or combineByKey transformation in spark requires user to provide initialValue. I read some articles about it. In every article, it is said that initialValue should not affect the final result i.e use 0 in addition function, 1 in multiplication etc. Then what is the relevance of this value.

Take example of foldByKey,as per spark documentation

Merge the values for each key using an associative function and a neutral "zero value" which may be added to the result an arbitrary number of times, and must not change the result

Following is the code example of foldByKey using 0 as initial value:

rdd.foldByKey(0, (v1,v2)->v1+v2); where value type of RDD in Integer.

However, reduceByKey will also provide the same result.

Can somebody please shed a light on it that why is the initialValue parameter needed. Does it have any functional/performance benefits?

Sourav Gulati
  • 1,359
  • 9
  • 18
  • 2
    Take a look at the http://stackoverflow.com/q/34529953/1560062, add `ByKey` to the method name and you'll have your answer :) – zero323 Mar 20 '17 at 16:53
  • @zero323 Great explanation :) One more thing: both reduceByKey and foldByKey are using combineByKeyWithClassTag inside, so there should be no big performance difference – T. Gawęda Mar 20 '17 at 17:09
  • 1
    `combineByKey` is a the most general of them, and one is able to express `foldByKey` and `reduceByKey` with it. An initial value is useful for cases where you want a different type as the seed. – Yuval Itzchakov Mar 20 '17 at 17:41
  • @zero323: Your answer on the url you mentioned seems not to br applicable in case of transformation (i.e foldByKey not fold). Following is the reference code. `JavaRDD> emptyRDD= javaSparkContext.emptyRDD(); JavaPairRDD emptyPairRDD = JavaPairRDD.fromJavaRDD(emptyRDD ); JavaPairRDD foldByKey = emptyPairRDD.foldByKey(0,(v1,v2)->v1+v2); JavaPairRDD reduceByKey = emptyPairRDD.reduceByKey((v1,v2)->v1+v2);` . Both foldByKey and ReduceByKey outputs are same – Sourav Gulati Mar 21 '17 at 04:09
  • 1
    I understand it as there is no key, It should be like this. The "Mutable buffer" thing you mentioned for fold, will it hold true for foldByKey? even if it is then as mentioned by @T.Gaweda both reduceByKey and foldByKey are using combineByKeyWithClassTag inside so it does not matter as well in terms of working. I am looking for some concrete answer regarding initial Value parameter in transformation. – Sourav Gulati Mar 21 '17 at 04:22

0 Answers0