Number of unique elements in all columns of a pyspark dataframe

Question

How it is possible to calculate the number of unique elements in each column of a pyspark dataframe:

import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame([[1, 100], [1, 200], [2, 300], [3, 100], [4, 100], [4, 300]], columns=['col1', 'col2'])
df_spark = spark.createDataFrame(df)
print(df_spark.show())
# +----+----+
# |col1|col2|
# +----+----+
# |   1| 100|
# |   1| 200|
# |   2| 300|
# |   3| 100|
# |   4| 100|
# |   4| 300|
# +----+----+

# Some transformations on df_spark here

# How to get a number of unique elements (just a number) in each columns?

I know only the following solution which is very slow, both of these lines are calculated in the same amount of time:

col1_num_unique = df_spark.select('col1').distinct().count()
col2_num_unique = df_spark.select('col2').distinct().count()

There are about 10 millions rows in df_spark.

Possible duplicate of [Spark DataFrame: count distinct values of every column](https://stackoverflow.com/questions/40888946/spark-dataframe-count-distinct-values-of-every-column). Essentially you can do `df_spark.select(*[countDistinct(c).alias(c) for c in df_spark.columns])` but you should keep in mind that this is an expensive operation and consider if [`pyspark.sql.functions.approxCountDistinct()`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.approxCountDistinct) is suitable for you. — pault, Dec 13 '18 at 15:01

score 13 · Accepted Answer · answered Dec 13 '18 at 15:08

Try this:

from pyspark.sql.functions import col, countDistinct

df_spark.agg(*(countDistinct(col(c)).alias(c) for c in df_spark.columns))

EDIT: As @pault suggested, its an expensive operation and you can use approx_count_distinct() The one he suggested is currently deprecated (spark version >= 2.1)

score 7 · Answer 2 · answered Mar 15 '21 at 09:08

7

@Manrique solved the problem, but only slightly modified solution worked for me:

expression = [countDistinct(c).alias(c) for c in df.columns]
df.select(*expression).show()

answered Mar 15 '21 at 09:08

Yauheni Leaniuk

418
1
6
15

score 5 · Answer 3 · answered Apr 28 '20 at 13:31

5

This is mush faster:

df_spark.select(F.countDistinct("col1")).show()

answered Apr 28 '20 at 13:31

Saeed Esmaili

764
3
12
34

Number of unique elements in all columns of a pyspark dataframe

3 Answers3