I have a list of categorical columns from a dataframe and I want to find the cardinality of each of the column. Please guide me. I am trying this
data.select(cat_columns).distinct().count()
but somehow this does not work. Thanks
I have a list of categorical columns from a dataframe and I want to find the cardinality of each of the column. Please guide me. I am trying this
data.select(cat_columns).distinct().count()
but somehow this does not work. Thanks
approx_count_distinct() should do the trick here if you want to iterate over a large dataframe.
Though, take into consideration it has rsd – maximum estimation error allowed (default = 0.05). For rsd < 0.01, it is more efficient to use countDistinct()
cols_cardinality = df_dataset.select(*[approx_count_distinct(c).alias(c) for c in df_dataset.columns])