0

I have a list of categorical columns from a dataframe and I want to find the cardinality of each of the column. Please guide me. I am trying this

data.select(cat_columns).distinct().count()

but somehow this does not work. Thanks

Ajg
  • 247
  • 2
  • 5
  • 14
  • data.select(cat_columns).distinct().count() This gives me unique rows with all those columns. – Ajg May 02 '17 at 16:49
  • I tried `cardinality=[]` `for col in cat_columns2:` `cardinality.append(data.select(col).distinct().count())` But it is very slow. I don't think it is distributed. – Ajg May 02 '17 at 16:50
  • solved here: https://stackoverflow.com/questions/40888946/spark-dataframe-count-distinct-values-of-every-column – DivyaJyoti Rajdev Sep 16 '19 at 23:09
  • Possible duplicate of [Spark DataFrame: count distinct values of every column](https://stackoverflow.com/questions/40888946/spark-dataframe-count-distinct-values-of-every-column) – DivyaJyoti Rajdev Sep 16 '19 at 23:11

1 Answers1

0

approx_count_distinct() should do the trick here if you want to iterate over a large dataframe.

Though, take into consideration it has rsd – maximum estimation error allowed (default = 0.05). For rsd < 0.01, it is more efficient to use countDistinct()

cols_cardinality = df_dataset.select(*[approx_count_distinct(c).alias(c) for c in df_dataset.columns])
Yash Karle
  • 11
  • 4