Spark: Use aggregation function on all columns

Asked Mar 23 '18 at 10:45

Active Mar 23 '18 at 10:45

Viewed 35 times

Basically, I want to do the following, but without the for-loop:

from pyspark.sql import functions as F

for myCol in df.schema.names:
    uniqs[myCol] = df.groupby("colX").agg(F.countDistinct(myCol)).collect()

I tried

uniqs = df.groupby("colX").agg(F.countDistinct(*df.schema.names)).collect()

but that does something else. The reason why I want to avoid the for-loop is that this way, the groupby operation is done n times instead of just the one time, incurring heavy overhead...

I'm on Spark 1.6.2.

asked Mar 23 '18 at 10:45

Thomas

4,696
5
36
71

2

Besides my accepted answer you can look into @mtoto's answer – eliasah Mar 23 '18 at 10:50
No idea how that did not turn up in my search, it is exactly what I was looking for. Clever trick to move the python loop into the aggregation command. Thanks! – Thomas Mar 23 '18 at 10:58

Spark: Use aggregation function on all columns

0 Answers0