I have the following input
scala> val x = sc.parallelize(Array(("a", 1), ("b", 1), ("a", 1),
| | ("a", 1), ("b", 1), ("b", 1),
| | ("b", 1), ("b", 1)), 3)
x: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[3] at parallelize at <console>:24
When I use groupBykey API,
scala> val y = x.groupByKey
y: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[7] at groupByKey at <console>:25
scala> y.collect
res20: Array[(String, Iterable[Int])] = Array((a,CompactBuffer(1, 1, 1)), (b,CompactBuffer(1, 1, 1, 1, 1)))
With groupByKey it is not necessary to specify any transformation. Since groupByKey is not efficient I cannot use it.
On looking I saw that reduceByKey and aggregateByKey require transformations actions on the input data.
Is it possible to achieve the groupByKey behavior using reduceByKey or aggregateByKey?