pyspark Window.partitionBy vs groupBy

Question

Lets say I have a dataset with around 2.1 billion records.

It's a dataset with customer information and I want to know how many times they did something. So I should group on the ID and sum one column (It has 0 and 1 values where the 1 indicates an action).

Now, I can use a simple groupBy and agg(sum) it, but to my understanding this is not really efficient. The groupBy will move around a lot of data between partitions.

Alternatively, I can also use a Window function with a partitionBy clause and then sum the data. One of the disadvantage is that I'll then have to apply an extra filter cause it keeps all the data. And I want one record per ID.

But I don't see how this Window handles the data. Is it better than this groupBy and sum. Or is it the same?

score 16 · Accepted Answer · answered Nov 08 '17 at 10:17

As far as I know, when working with spark DataFrames, the groupBy operation is optimized via Catalyst. The groupBy on DataFrames is unlike the groupBy on RDDs.

For instance, the groupBy on DataFrames performs the aggregation on partitions first, and then shuffles the aggregated results for the final aggregation stage. Hence, only the reduced, aggregated results get shuffled, not the entire data. This is similar to reduceByKey or aggregateByKey on RDDs. See this related SO-article with a nice example.

In addition, see slide 5 in this presentation by Yin Huai which covers the benefits of using DataFrames in conjunction with Catalyst.

Concluding, I think you're fine employing groupBy when using spark DataFrames. Using Window does not seem appropriate to me for your requirement.

if spark does the aggregation in the partitions first before shuffling, then why in skew cases many suggest to use salt in aggregation? example: df.groupBy('salt',the_keys).agg(target).groupBy(the_keys).agg(target) — zonna, Aug 12 '22 at 22:04

pyspark Window.partitionBy vs groupBy

1 Answers1