Pandas groupby then drop groups below specified size

Question

I'm trying to separate a DataFrame into groups and drop groups below a minimum size (small outliers).

Here's what I've tried:

df.groupby(['A']).filter(lambda x: x.count() > min_size)
df.groupby(['A']).filter(lambda x: x.size() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].count() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].size() > min_size)

But these either throw an exception or return a different table than I'm expecting. I'd just like to filter, not compute a new table.

score 14 · Accepted Answer · answered Feb 08 '19 at 00:19

14

You can use len:

In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])

In [12]: df.groupby('A').filter(lambda x: len(x) > 1)
Out[12]:
   A  B
0  1  2
1  1  4

answered Feb 08 '19 at 00:19

Andy Hayden

359,921
101
625
535

1

Thanks. I didn't realize that `filter` is actually returning the original DataFrame (not the grouped DataFrame). So the correct way to get the groups with size > N is `df.groupby('A').filter(lambda x: len(x) > N).groupby('A')`. – Caleb Jares Feb 08 '19 at 15:43

score 2 · Answer 2 · answered Feb 08 '19 at 00:16

2

The number of rows is in the attribute .shape[0]:

df.groupby('A').filter(lambda x: x.shape[0] >= min_size)

NB: If you want to remove the groups below the minimum size, keep those that are above or at the minimum size (>=, not >).

answered Feb 08 '19 at 00:16

DYZ

55,249
10
64
93

score 1 · Answer 3 · answered Aug 03 '21 at 13:54

groupby.filter can be very slow for larger dataset / a large number of groups. A faster approach is to use groupby.transform:

Here's an example, first create the dataset:

import pandas as pd
import numpy as np

df = pd.concat([
    pd.DataFrame({'y': np.random.randn(np.random.randint(1,5))}).assign(A=str(i)) 
    for i in range(1,1000)
]).reset_index(drop=True)
print(df)

             y    A
0     1.375980    1
1    -0.023861    1
2    -0.474707    1
3    -0.151859    2
4    -1.696823    2
...        ...  ...
2424  0.276737  998
2425 -0.142171  999
2426 -0.718891  999
2427 -0.621315  999
2428  1.335450  999

[2429 rows x 2 columns]

Time it:

Pandas groupby then drop groups below specified size

3 Answers3

Linked