0

I want to calculate the number of distinct values for all columns in a DataFrame.

Say, I have a DataFrame like this:

x y z
-----
0 0 0
0 1 1
0 1 2

And I want another DataFrame (or any other structure) of format:

col | num
---------
'x' |  1
'y' |  2
'z' |  3

What would be the most efficient way of doing that?

1 Answers1

1

You can use countDistinct to count distinct values; to apply this to all columns, use map on the columns to construct a list of expressions, and then apply this to agg function with varargs syntax:

val exprs = df.columns.map(x => countDistinct(x).as(x))
df.agg(exprs.head, exprs.tail: _*).show
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  1|  2|  3|
+---+---+---+
Psidom
  • 209,562
  • 33
  • 339
  • 356