Avoiding shuffle on GROUP BY in Spark SQL

Question

I'd like to execute a GROUP BY clause on properly partitioned DataFrame while grouping by the column that is partition key as well. Obviously, in this case there's no actual need for shuffling as all equal keys already reside in the same partitions. However, I can't figure out how to actually avoid such shuffle and whether it's possible at all. I tried bucketing and partitioning options on DataFrameWriter, but those don't seem to help much as I continue seeing exchanges in plan. Are there any ways to do similar thing besides, say, mapPartitions?

I do not understand why this question has been marked as duplicate. The questions are related as both deal with partitioning and shuffling. But the other question deals with the handling of partitions when writing and re-reading a dataframe while this one asks here about how to optimize a groupBy operation. The accepted [answer](https://stackoverflow.com/a/48460070/2129801) does not help how one could solve the groupBy question. The [other answer](https://stackoverflow.com/a/48538119/2129801) gives a hint to use `bucketBy`, but would this answer also apply here? Am I missing something? — werner, May 27 '18 at 18:50

Avoiding shuffle on GROUP BY in Spark SQL

0 Answers0