I am looking for a way to remove rows from a dataframe that contain low frequency items. I adapted the following snippet from this post:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)
The problem is, that this code does not scale, it seems.
The line to_remove = value_counts[value_counts <= threshold].index
has now been running for several hours for my data (2 GB compressed HDFStore). I therefore need a better solution. Ideally out-of-core. I suspect dask.dataframe
is suitable, but I fail to express the above code in terms of dask. The key functions stack
and replace
are absent from dask.dataframe
.
I tried the following (works in normal pandas) to work around the lack of these two functions:
value_countss = [df[col].value_counts() for col in df.columns]
infrequent_itemss = [value_counts[value_counts < 3] for value_counts in value_countss]
rows_to_drop = set(i for indices in [df.loc[df[col].isin(infrequent_items.keys())].index.values for col, infrequent_items in zip(df.columns, infrequent_itemss)] for i in indices)
df.drop(rows_to_drop)
That does not actually work with dask though. It errors at infrequent_items.keys()
.
Even if it did work, given that this is the opposite of elegant, I suspect there must be a better way.
Can you suggest something?