How to count occurrences of each distinct value for every column in a dataframe?

Question

edf.select("x").distinct.show() shows the distinct values that are present in x column of edf DataFrame.

Is there an efficient method to also show the number of times these distinct values occur in the data frame? (count for each distinct value)

score 78 · Accepted Answer · edited Jul 08 '18 at 10:40

countDistinct is probably the first choice:

import org.apache.spark.sql.functions.countDistinct

df.agg(countDistinct("some_column"))

If speed is more important than the accuracy you may consider approx_count_distinct (approxCountDistinct in Spark 1.x):

import org.apache.spark.sql.functions.approx_count_distinct

df.agg(approx_count_distinct("some_column"))

To get values and counts:

df.groupBy("some_column").count()

In SQL (spark-sql):

SELECT COUNT(DISTINCT some_column) FROM df

and

SELECT approx_count_distinct(some_column) FROM df

score 14 · Answer 2 · answered Dec 23 '19 at 15:09

14

Roughly speaking, how it works:

answered Dec 23 '19 at 15:09

Saurav Sahu

13,038
6
64
79

score 13 · Answer 3 · answered Jan 26 '19 at 17:39

13

Another option without resorting to sql functions

df.groupBy('your_column_name').count().show()

show will print the different values and their occurrences. The result without show will be a dataframe.

answered Jan 26 '19 at 17:39

Antoni

2,542
20
21

score 6 · Answer 4 · edited Aug 28 '19 at 16:42

6

import org.apache.spark.sql.functions.countDistinct

df.groupBy("a").agg(countDistinct("s")).collect()

edited Aug 28 '19 at 16:42

Community

1
1

answered Aug 16 '18 at 03:32

user10232195

61
1
3

7

Can you explain your answer further on how it works? – chevybow Aug 16 '18 at 03:52

ForeverLearner · Answer 5 · 2021-12-08T14:09:57.443

3

If you are using Java, then import org.apache.spark.sql.functions.countDistinct; will give an error : The import org.apache.spark.sql.functions.countDistinct cannot be resolved

To use the countDistinct in java, use the below format:

import org.apache.spark.sql.functions.*;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.*;

df.agg(functions.countDistinct("some_column"));

edited Dec 08 '21 at 14:09

answered Nov 20 '19 at 08:55

ForeverLearner

1,901
2
28
51

score 1 · Answer 6 · edited Dec 13 '16 at 07:43

1

df.select("some_column").distinct.count

edited Dec 13 '16 at 07:43

Petter Friberg

21,252
9
60
109

answered Dec 13 '16 at 06:03

shengshan zhang

538
8
16

Does this tell you how count of each distinct values? I think this would tell you that you have X values, not that Val1 has A, Val2 has B,.. ValX has C? – Dan Ciborowski - MSFT Dec 19 '17 at 14:57
1

This is not an answer to the question. – Adam Arold Dec 10 '18 at 14:48
2

This is too slow to compute, the best bet is using `countDistinct` – Abu Shoeb May 20 '19 at 04:21

How to count occurrences of each distinct value for every column in a dataframe?

6 Answers6

Roughly speaking, how it works:

Linked