40

edf.select("x").distinct.show() shows the distinct values that are present in x column of edf DataFrame.

Is there an efficient method to also show the number of times these distinct values occur in the data frame? (count for each distinct value)

Leothorn
  • 1,345
  • 1
  • 23
  • 45

6 Answers6

78

countDistinct is probably the first choice:

import org.apache.spark.sql.functions.countDistinct

df.agg(countDistinct("some_column"))

If speed is more important than the accuracy you may consider approx_count_distinct (approxCountDistinct in Spark 1.x):

import org.apache.spark.sql.functions.approx_count_distinct

df.agg(approx_count_distinct("some_column"))

To get values and counts:

df.groupBy("some_column").count()

In SQL (spark-sql):

SELECT COUNT(DISTINCT some_column) FROM df

and

SELECT approx_count_distinct(some_column) FROM df
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
zero323
  • 322,348
  • 103
  • 959
  • 935
14

Roughly speaking, how it works:

enter image description here

enter image description here

Saurav Sahu
  • 13,038
  • 6
  • 64
  • 79
13

Another option without resorting to sql functions

df.groupBy('your_column_name').count().show()

show will print the different values and their occurrences. The result without show will be a dataframe.

Antoni
  • 2,542
  • 20
  • 21
6
import org.apache.spark.sql.functions.countDistinct

df.groupBy("a").agg(countDistinct("s")).collect()
Community
  • 1
  • 1
user10232195
  • 61
  • 1
  • 3
3

If you are using Java, then import org.apache.spark.sql.functions.countDistinct; will give an error : The import org.apache.spark.sql.functions.countDistinct cannot be resolved

To use the countDistinct in java, use the below format:

import org.apache.spark.sql.functions.*;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.*;

df.agg(functions.countDistinct("some_column"));
ForeverLearner
  • 1,901
  • 2
  • 28
  • 51
1
df.select("some_column").distinct.count
Petter Friberg
  • 21,252
  • 9
  • 60
  • 109
shengshan zhang
  • 538
  • 8
  • 16