9

I'm trying to separate a DataFrame into groups and drop groups below a minimum size (small outliers).

Here's what I've tried:

df.groupby(['A']).filter(lambda x: x.count() > min_size)
df.groupby(['A']).filter(lambda x: x.size() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].count() > min_size)
df.groupby(['A']).filter(lambda x: x['A'].size() > min_size)

But these either throw an exception or return a different table than I'm expecting. I'd just like to filter, not compute a new table.

Caleb Jares
  • 6,163
  • 6
  • 56
  • 83

3 Answers3

14

You can use len:

In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])

In [12]: df.groupby('A').filter(lambda x: len(x) > 1)
Out[12]:
   A  B
0  1  2
1  1  4
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 1
    Thanks. I didn't realize that `filter` is actually returning the original DataFrame (not the grouped DataFrame). So the correct way to get the groups with size > N is `df.groupby('A').filter(lambda x: len(x) > N).groupby('A')`. – Caleb Jares Feb 08 '19 at 15:43
2

The number of rows is in the attribute .shape[0]:

df.groupby('A').filter(lambda x: x.shape[0] >= min_size)

NB: If you want to remove the groups below the minimum size, keep those that are above or at the minimum size (>=, not >).

DYZ
  • 55,249
  • 10
  • 64
  • 93
1

groupby.filter can be very slow for larger dataset / a large number of groups. A faster approach is to use groupby.transform:

Here's an example, first create the dataset:

import pandas as pd
import numpy as np

df = pd.concat([
    pd.DataFrame({'y': np.random.randn(np.random.randint(1,5))}).assign(A=str(i)) 
    for i in range(1,1000)
]).reset_index(drop=True)
print(df)
             y    A
0     1.375980    1
1    -0.023861    1
2    -0.474707    1
3    -0.151859    2
4    -1.696823    2
...        ...  ...
2424  0.276737  998
2425 -0.142171  999
2426 -0.718891  999
2427 -0.621315  999
2428  1.335450  999

[2429 rows x 2 columns]

Time it:

timing

jwdink
  • 4,824
  • 5
  • 18
  • 20