How to subset a data frame using Pandas based on a group criteria?

Question

I have a large data set with the following structure

I would like to take a subset of the data such that the sum of column X for each User is 0. Given the above example, the subset should only include the observations for users 1 and 3 as follows

Is there a way to do this using the groupby function without grouping the data? I want the subset to include the individual observations.

Do we know that all `X` are >= 0, or do you need to worry about `-1,+1`? — DSM, Jan 09 '15 at 19:51

score 10 · Answer 1 · answered Jan 09 '15 at 20:04

10

As an alternative to @unutbu's answer, there's also

>>> df.loc[df.groupby("User")["X"].transform(sum) == 0]
   User  X
0     1  0
1     1  0
5     3  0
6     3  0

This creates a df-length boolean Series to use as a selector:

>>> df.groupby("User")["X"].transform(sum) == 0
0     True
1     True
2    False
3    False
4    False
5     True
6     True
dtype: bool

transform is used when you want to "broadcast" the result of a groupby reduction operation back up to all the elements of each group. It comes in handy.

answered Jan 09 '15 at 20:04

DSM

342,061
65
592
494

can u do a pull request to add this to the cookbook? (maybe to the SQL section as well) - this is basically the having statement – Jeff Jan 09 '15 at 22:48
This is a significantly better answer than mine since it works well even if the DataFrame has a non-unique index. My method can be quite slow in that case. Selecting with a full boolean mask is more robust than selecting by index values. – unutbu Jan 10 '15 at 00:36

score 5 · Accepted Answer · edited May 23 '17 at 12:09

DSM's answer, which selects rows using a boolean mask, works well even if the DataFrame has a non-unique index. My method, which selects rows using index values, is slightly slower when the index is unique and significantly slower when the index contains duplicate values.

@roland: Please consider accepting DSM's answer instead.

You could use a groupby-filter:

In [16]: df.loc[df.groupby('User')['X'].filter(lambda x: x.sum() == 0).index]
Out[16]: 
   User  X
0     1  0
1     1  0
5     3  0
6     3  0

By itself, the groupby-filter just returns this:

In [29]: df.groupby('User')['X'].filter(lambda x: x.sum() == 0)
Out[29]: 
0    0
1    0
5    0
6    0
Name: X, dtype: int64

but you can then use its index,

In [30]: df.groupby('User')['X'].filter(lambda x: x.sum() == 0).index
Out[30]: Int64Index([0, 1, 5, 6], dtype='int64')

to select the desired rows using df.loc.

Here is the benchmark I used:

In [49]: df2 = pd.concat([df]*10000)   # df2 has a non-unique index

I Ctrl-C'd this one because it was taking too long to finish:

In [50]: %timeit df2.loc[df2.groupby('User')['X'].filter(lambda x: x.sum() == 0).index]

When I realized my mistake, I made a DataFrame with a unique index:

In [51]: df3 = df2.reset_index()     # this gives df3 a unique index

In [52]: %timeit df3.loc[df3.groupby('User')['X'].filter(lambda x: x.sum() == 0).index]
100 loops, best of 3: 13 ms per loop

In [53]: %timeit df3.loc[df3.groupby("User")["X"].transform(sum) == 0]
100 loops, best of 3: 11.4 ms per loop

This shows DSM's method performs well even with a non-unique index:

In [54]: %timeit df2.loc[df2.groupby("User")["X"].transform(sum) == 0]
100 loops, best of 3: 11.2 ms per loop

How to subset a data frame using Pandas based on a group criteria?

2 Answers2

Linked