0

I want to do a splitting task, but that requires a minimum number of samples per class, so I want to filter a Dataframe by a column that identifies class labels. If the frequency occurrence of the class is below some threshold, then we want to filter that out.

>>> df = pd.DataFrame([[1,2,3], [4,5,6], [0,0,6]])
>>> df
   0  1  2
0  1  2  3
1  4  5  6
2  0  0  6

>>> filter_on_col(df, col=2, threshold=6)  # Removes first row
   0  1  2
0  4  5  6
1  0  0  6

I can do something like df[2].value_counts() to get frequency of each value in column 2, and then I can figure out which values exceed my threshold simply by:

>>>`df[2].value_counts() > 2`
 3      False
 6      True

and then the logic for figuring out the rest is pretty easy.

But I feel like there's an elegant, Pandas one-liner here that I can do, or maybe a more efficient method.

My question is pretty similar to: Select rows from a DataFrame based on values in a column in pandas, but the tricky part is that I'm relying on value frequency rather than the values themselves.

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
Dave Liu
  • 906
  • 1
  • 11
  • 31
  • Am I missing something or does `df[df[2] >= 6]` not work? – cs95 Jun 06 '19 at 23:15
  • 1
    If I understand you correctly, you are looking for `df[df.groupby(2)[2].transform('size') > 6]` – Erfan Jun 06 '19 at 23:16
  • @cs95 No, that doesn't, because it only handles one number (6), but what if there are other values that occur more than twice? Sorry, my example had a bug. – Dave Liu Jun 06 '19 at 23:19
  • @Erfan Yes, that's exactly what I was looking for! If this question reopens, I'll gladly accept a formal answer post from you. – Dave Liu Jun 06 '19 at 23:29
  • @cs95 Also, you're getting the value of the column. "the tricky part is that I'm relying on value frequency rather than the values themselves" – Dave Liu Jun 06 '19 at 23:30
  • 1
    @Erfan I've reopened the question, go for it – cs95 Jun 06 '19 at 23:39

1 Answers1

1

So this is a one-liner:

# Assuming the parameters of your specific example posed above.
col=2; thresh=2

df[df[col].isin(df[col].value_counts().get(thresh).loc[lambda x : x].index)]

Out[303]: 
   0  1  2
1  4  5  6
2  0  0  6

Or another one-liner:

df[df.groupby(col)[col].transform('count')>thresh,]
Dave Liu
  • 906
  • 1
  • 11
  • 31
BENY
  • 317,841
  • 20
  • 164
  • 234