I don't know if that is the fastest solution, but you can use .melt()
to unpivot your dataframe columns and then use .groupby()
on the variable column to count the unique values in each group in order to get significant performance improvement over your column per column solution:
dd.read_csv('test.csv').melt().groupby('variable')['value'].nunique().compute()
Let us generate some random integer data and save as csv:
import numpy as np
import pandas as pd
from dask import dataframe as dd
nrows = 10000
ncols = 120
rng = np.random.default_rng(seed=1)
random_data = rng.integers(low=0, high=nrows/2, size=(nrows,ncols))
pd.DataFrame(data=random_data).add_prefix('col_').to_csv('test.csv', index=False)
We use the following two functions for performance evaluation:
def nunique_per_column():
dask_df = dd.read_csv('test.csv')
counts = []
for col in dask_df.columns:
counts.append(dask_df[col].nunique().compute())
return pd.Series(counts, index=dask_df.columns)
def melt_groupby_nunique():
return dd.read_csv('test.csv').melt().groupby('variable')['value'].nunique().compute()
First check if both functions compute the same result with:
pd.testing.assert_series_equal(nunique_per_column().sort_index(),
melt_groupby_nunique().sort_index(),
check_names=False)
%timeit
on the functions and the sample data yields the following output on my machine:
%timeit nunique_per_column()
17.5 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit melt_groupby_nunique()
1.78 s ± 576 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)