0

I have the following input

scala> val x = sc.parallelize(Array(("a", 1), ("b", 1), ("a", 1),
     |      | ("a", 1), ("b", 1), ("b", 1),
     |      | ("b", 1), ("b", 1)), 3)
x: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:24

When I use groupBykey API,

scala> val y = x.groupByKey
y: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[7] at groupByKey at <console>:25

scala> y.collect
res20: Array[(String, Iterable[Int])] = Array((a,CompactBuffer(1, 1, 1)), (b,CompactBuffer(1, 1, 1, 1, 1)))

With groupByKey it is not necessary to specify any transformation. Since groupByKey is not efficient I cannot use it.

On looking I saw that reduceByKey and aggregateByKey require transformations actions on the input data.

Is it possible to achieve the groupByKey behavior using reduceByKey or aggregateByKey?

abc123
  • 527
  • 5
  • 16
  • 6
    [It is possible](https://stackoverflow.com/questions/35388277/replace-groupbykey-with-reducebykey) but [you shouldn't do it](https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey). Also [this](https://stackoverflow.com/q/31029395/10465355) and [this](https://stackoverflow.com/a/39316189/10465355) – 10465355 Dec 04 '18 at 00:24
  • If groupByKey is not efficient enough (or results in memory issues), then the other functions will also face that issue. They are also built on combineByKey. – sil Dec 04 '18 at 16:00

0 Answers0