0

Basically, I want to do the following, but without the for-loop:

from pyspark.sql import functions as F

for myCol in df.schema.names:
    uniqs[myCol] = df.groupby("colX").agg(F.countDistinct(myCol)).collect()

I tried

uniqs = df.groupby("colX").agg(F.countDistinct(*df.schema.names)).collect()

but that does something else. The reason why I want to avoid the for-loop is that this way, the groupby operation is done n times instead of just the one time, incurring heavy overhead...

I'm on Spark 1.6.2.

Thomas
  • 4,696
  • 5
  • 36
  • 71
  • 2
    Besides my accepted answer you can look into @mtoto's answer – eliasah Mar 23 '18 at 10:50
  • No idea how that did not turn up in my search, it is exactly what I was looking for. Clever trick to move the python loop into the aggregation command. Thanks! – Thomas Mar 23 '18 at 10:58

0 Answers0