78

I'm working in Python with a pandas DataFrame of video games, each with a genre. I'm trying to remove any video game with a genre that appears less than some number of times in the DataFrame, but I have no clue how to go about this. I did find a StackOverflow question that seems to be related, but I can't decipher the solution at all (possibly because I've never heard of R and my memory of functional programming is rusty at best).

Help?

Community
  • 1
  • 1
uchuujin
  • 825
  • 1
  • 8
  • 8

5 Answers5

110

Use groupby filter:

In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])

In [12]: df
Out[12]:
   A  B
0  1  2
1  1  4
2  5  6

In [13]: df.groupby("A").filter(lambda x: len(x) > 1)
Out[13]:
   A  B
0  1  2
1  1  4

I recommend reading the split-combine-section of the docs.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 2
    And we can even group by MultiIndex level: groupby(level=...). Exactly what i've needed – Winand Jun 16 '15 at 10:07
  • 1
    this didn't work for me - I got the following error, not sure what to make of it: "filter function returned a Series, but expected a scalar bool" @Andy Hayden – ZakS Jul 08 '18 at 10:52
41

Solutions with better performance should be GroupBy.transform with size for count per groups to Series with same size like original df, so possible filter by boolean indexing:

df1 = df[df.groupby("A")['A'].transform('size') > 1]

Or use Series.map with Series.value_counts:

df1 = df[df['A'].map(df['A'].value_counts()) > 1]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 5
    For what it's worth regarding performance, I ran the `Series.map` solution here against the `groupby.filter` solution above through `%%timeit` with the following results (on a dataframe of mostly JSON string data, grouping on a string ID column): `Series map: 2.34 ms ± 254 µs per loop`, `Groupby.filter: 269 ms ± 41.3 ms per loop`. So the `Series.map` solution is much faster – jlhasson Apr 01 '20 at 14:50
  • what is the group by formed by multiple columns? – quest Oct 14 '20 at 14:16
4

@jezael solution works very well, Here is a different approach to filter based on values count :

For example, if the dataset is :

df = pd.DataFrame({'a': [1,2,3,3,1,6], 'b': [11,2,33,4,55,6]})

Convert and save the count as a dictionary

ount_freq = dict(df['a'].value_counts())

Create a new column and copy the target column, map the dictionary with newly created column

df['count_freq'] = df['a']
df['count_freq'] = df['count_freq'].map(count_freq)

Now we have a new column with count freq, you can now define a threshold and filter easily with this column.

df[df.count_freq>1]
Aaditya Ura
  • 12,007
  • 7
  • 50
  • 88
0

Additionlly, in case one wants to filter and have 'count' column:

attr = 'A'
limit = 10
df2 = df.groupby(attr)[attr].agg(count='count')
df2 = df2.loc[df2['count'] > limit].reset_index()
print(df2)

#outputs rows with grouped 'A' count > 10 and columns ==> index, count, A
0

I might be a little late to this party but:

df = pd.DataFrame(df_you_have.groupby(['IdA', 'SomeOtherA'])['theA_you_want_to_count'].count())
df.reset_index(inplace=True)

This is how you create a new dataframe and then just filter it...

df[df['A']>100]
Nik Cadez
  • 1
  • 1