This link and others tell me that the Spark groupByKey
is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy
function as well? Or is this something different?
I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this (frankly, I find the documentation quite lacking).
Essentially, I am trying to do is:
# Non-working pseudocode
df.groupBy("A").reduce(lambda x,y: if (x.TotalValue > y.TotalValue) x else y)
However, the dataframe API does not offer a "reduce" option. I'm probably misunderstanding what exactly dataframe is trying to achieve.