I want to do a splitting task, but that requires a minimum number of samples per class, so I want to filter a Dataframe by a column that identifies class labels. If the frequency occurrence of the class is below some threshold, then we want to filter that out.
>>> df = pd.DataFrame([[1,2,3], [4,5,6], [0,0,6]])
>>> df
0 1 2
0 1 2 3
1 4 5 6
2 0 0 6
>>> filter_on_col(df, col=2, threshold=6) # Removes first row
0 1 2
0 4 5 6
1 0 0 6
I can do something like df[2].value_counts()
to get frequency of each value in column 2
, and then I can figure out which values exceed my threshold simply by:
>>>`df[2].value_counts() > 2`
3 False
6 True
and then the logic for figuring out the rest is pretty easy.
But I feel like there's an elegant, Pandas one-liner here that I can do, or maybe a more efficient method.
My question is pretty similar to: Select rows from a DataFrame based on values in a column in pandas, but the tricky part is that I'm relying on value frequency rather than the values themselves.