12

How it is possible to calculate the number of unique elements in each column of a pyspark dataframe:

import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame([[1, 100], [1, 200], [2, 300], [3, 100], [4, 100], [4, 300]], columns=['col1', 'col2'])
df_spark = spark.createDataFrame(df)
print(df_spark.show())
# +----+----+
# |col1|col2|
# +----+----+
# |   1| 100|
# |   1| 200|
# |   2| 300|
# |   3| 100|
# |   4| 100|
# |   4| 300|
# +----+----+

# Some transformations on df_spark here

# How to get a number of unique elements (just a number) in each columns?

I know only the following solution which is very slow, both of these lines are calculated in the same amount of time:

col1_num_unique = df_spark.select('col1').distinct().count()
col2_num_unique = df_spark.select('col2').distinct().count()

There are about 10 millions rows in df_spark.

zero323
  • 322,348
  • 103
  • 959
  • 935
Konstantin
  • 2,937
  • 10
  • 41
  • 58
  • 2
    Possible duplicate of [Spark DataFrame: count distinct values of every column](https://stackoverflow.com/questions/40888946/spark-dataframe-count-distinct-values-of-every-column). Essentially you can do `df_spark.select(*[countDistinct(c).alias(c) for c in df_spark.columns])` but you should keep in mind that this is an expensive operation and consider if [`pyspark.sql.functions.approxCountDistinct()`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.approxCountDistinct) is suitable for you. – pault Dec 13 '18 at 15:01

3 Answers3

13

Try this:

from pyspark.sql.functions import col, countDistinct

df_spark.agg(*(countDistinct(col(c)).alias(c) for c in df_spark.columns))

EDIT: As @pault suggested, its an expensive operation and you can use approx_count_distinct() The one he suggested is currently deprecated (spark version >= 2.1)

Manrique
  • 2,083
  • 3
  • 15
  • 38
7

@Manrique solved the problem, but only slightly modified solution worked for me:

expression = [countDistinct(c).alias(c) for c in df.columns]
df.select(*expression).show()
Yauheni Leaniuk
  • 418
  • 1
  • 6
  • 15
5

This is mush faster:

df_spark.select(F.countDistinct("col1")).show()
Saeed Esmaili
  • 764
  • 3
  • 12
  • 34