Lets say I have a dataset with around 2.1 billion records.
It's a dataset with customer information and I want to know how many times they did something. So I should group on the ID and sum one column (It has 0 and 1 values where the 1 indicates an action).
Now, I can use a simple groupBy
and agg(sum)
it, but to my understanding this is not really efficient. The groupBy
will move around a lot of data between partitions.
Alternatively, I can also use a Window function with a partitionBy
clause and then sum the data. One of the disadvantage is that I'll then have to apply an extra filter cause it keeps all the data. And I want one record per ID.
But I don't see how this Window handles the data. Is it better than this groupBy and sum. Or is it the same?