Finding cardinality of multiple categorical columns in pyspark dataframe

Question

I have a list of categorical columns from a dataframe and I want to find the cardinality of each of the column. Please guide me. I am trying this

data.select(cat_columns).distinct().count()

but somehow this does not work. Thanks

data.select(cat_columns).distinct().count() This gives me unique rows with all those columns. — Ajg, May 02 '17 at 16:49
I tried `cardinality=[]` `for col in cat_columns2:` `cardinality.append(data.select(col).distinct().count())` But it is very slow. I don't think it is distributed. — Ajg, May 02 '17 at 16:50
solved here: https://stackoverflow.com/questions/40888946/spark-dataframe-count-distinct-values-of-every-column — DivyaJyoti Rajdev, Sep 16 '19 at 23:09
Possible duplicate of [Spark DataFrame: count distinct values of every column](https://stackoverflow.com/questions/40888946/spark-dataframe-count-distinct-values-of-every-column) — DivyaJyoti Rajdev, Sep 16 '19 at 23:11

score 0 · Answer 1 · answered Feb 21 '20 at 17:07

approx_count_distinct() should do the trick here if you want to iterate over a large dataframe.

Though, take into consideration it has rsd – maximum estimation error allowed (default = 0.05). For rsd < 0.01, it is more efficient to use countDistinct()

cols_cardinality = df_dataset.select(*[approx_count_distinct(c).alias(c) for c in df_dataset.columns])

Finding cardinality of multiple categorical columns in pyspark dataframe

1 Answers1