How do I filter a pandas DataFrame based on value counts?

Question

I'm working in Python with a pandas DataFrame of video games, each with a genre. I'm trying to remove any video game with a genre that appears less than some number of times in the DataFrame, but I have no clue how to go about this. I did find a StackOverflow question that seems to be related, but I can't decipher the solution at all (possibly because I've never heard of R and my memory of functional programming is rusty at best).

Help?

score 110 · Accepted Answer · answered Apr 24 '15 at 00:50

110

Use groupby filter:

In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])

In [12]: df
Out[12]:
   A  B
0  1  2
1  1  4
2  5  6

In [13]: df.groupby("A").filter(lambda x: len(x) > 1)
Out[13]:
   A  B
0  1  2
1  1  4

I recommend reading the split-combine-section of the docs.

answered Apr 24 '15 at 00:50

Andy Hayden

359,921
101
625
535

2

And we can even group by MultiIndex level: groupby(level=...). Exactly what i've needed – Winand Jun 16 '15 at 10:07
1

this didn't work for me - I got the following error, not sure what to make of it: "filter function returned a Series, but expected a scalar bool" @Andy Hayden – ZakS Jul 08 '18 at 10:52

jezrael · Answer 2 · 2019-10-24T08:11:51.447

41

Solutions with better performance should be GroupBy.transform with size for count per groups to Series with same size like original df, so possible filter by boolean indexing:

df1 = df[df.groupby("A")['A'].transform('size') > 1]

Or use Series.map with Series.value_counts:

df1 = df[df['A'].map(df['A'].value_counts()) > 1]

edited Oct 24 '19 at 08:11

answered Oct 24 '19 at 07:46

jezrael

822,522
95
1,334
1,252

5

For what it's worth regarding performance, I ran the `Series.map` solution here against the `groupby.filter` solution above through `%%timeit` with the following results (on a dataframe of mostly JSON string data, grouping on a string ID column): `Series map: 2.34 ms ± 254 µs per loop`, `Groupby.filter: 269 ms ± 41.3 ms per loop`. So the `Series.map` solution is much faster – jlhasson Apr 01 '20 at 14:50
what is the group by formed by multiple columns? – quest Oct 14 '20 at 14:16

score 4 · Answer 3 · answered Jan 18 '21 at 17:42

@jezael solution works very well, Here is a different approach to filter based on values count :

For example, if the dataset is :

df = pd.DataFrame({'a': [1,2,3,3,1,6], 'b': [11,2,33,4,55,6]})

Convert and save the count as a dictionary

ount_freq = dict(df['a'].value_counts())

Create a new column and copy the target column, map the dictionary with newly created column

df['count_freq'] = df['a']
df['count_freq'] = df['count_freq'].map(count_freq)

Now we have a new column with count freq, you can now define a threshold and filter easily with this column.

df[df.count_freq>1]

score 0 · Answer 4 · answered Jun 28 '21 at 06:29

Additionlly, in case one wants to filter and have 'count' column:

attr = 'A'
limit = 10
df2 = df.groupby(attr)[attr].agg(count='count')
df2 = df2.loc[df2['count'] > limit].reset_index()
print(df2)

#outputs rows with grouped 'A' count > 10 and columns ==> index, count, A

score 0 · Answer 5 · answered Apr 04 '22 at 16:29

I might be a little late to this party but:

df = pd.DataFrame(df_you_have.groupby(['IdA', 'SomeOtherA'])['theA_you_want_to_count'].count())
df.reset_index(inplace=True)

This is how you create a new dataframe and then just filter it...

df[df['A']>100]

How do I filter a pandas DataFrame based on value counts?

5 Answers5

Linked

Related