73

For example, I have the following table:

index,A,B
0,0,0
1,0,8
2,0,8
3,1,5
4,1,3

After grouping by A:

0:
index,A,B
0,0,0
1,0,8
2,0,8

1:
index,A,B
3,1,5
4,1,3

What I need is to drop rows from each group, where the number in column B is less than maximum value from all rows from group's column B. Well I have a problem translating and formulating this problem to English so here is the example:

Maximum value from rows in column B in group 0: 8

So I want to drop row with index 0 and keep rows with indexes 1 and 2

Maximum value from rows in column B in group 1: 5

So I want to drop row with index 4 and keep row with index 3

I have tried to use pandas filter function, but the problem is that it is operating on all rows in group at one time:

data = <example table>
grouped = data.groupby("A")
filtered = grouped.filter(lambda x: x["B"] == x["B"].max())

So what I ideally need is some filter, which iterates through all rows in group.

Thanks for help!

P.S. Is there also way to only delete rows in groups and do not return DataFrame object?

jirinovo
  • 2,105
  • 3
  • 26
  • 37
  • the last bit is totally ambiguous: if you're deleting rows in each group, but *not* returning a dataframe, then what do you want to return? – Paul H Dec 15 '14 at 16:59
  • Your data, when comparing the full table to the groups, don't match. please clear this up. – Paul H Dec 15 '14 at 17:02
  • Sorry I was a little bit busy when writting this question. Data are correct now. Well I mean just to delete rows from groups and keep these groups as they are - I need to apply several filters and after each apply is needed new groupby. – jirinovo Dec 15 '14 at 20:37
  • `df.query` and `pd.eval` seem like good fits for this use case. For information on the `pd.eval()` family of functions, their features and use cases, please visit [Dynamic Expression Evaluation in pandas using pd.eval()](https://stackoverflow.com/questions/53779986/dynamic-expression-evaluation-in-pandas-using-pd-eval). – cs95 Dec 16 '18 at 04:58
  • how do I apply the filter back to a groupby results. Do I have to convert the groupby results to a dataframe then apply the filter? I want to create a "having" affect. dfg=df2[(df2['AcctType']=='E')].groupby('CoaDescription')['TotalExpense'].sum() filter=dfg.apply(lambda g: g >10000) dfg=dfg[filter] ax=sns.barplot(y=dfg.index,x=dfg) plt.show() – Golden Lion Nov 19 '20 at 12:30

4 Answers4

87

You just need to use apply on the groupby object. I modified your example data to make this a little more clear:

import pandas
from io import StringIO

csv = StringIO("""index,A,B
0,1,0.0
1,1,3.0
2,1,6.0
3,2,0.0
4,2,5.0
5,2,7.0""")

df = pandas.read_csv(csv, index_col='index')
groups = df.groupby(by=['A'])
print(groups.apply(lambda g: g[g['B'] == g['B'].max()]))

Which prints:

         A  B
A index      
1 2      1  6
2 4      2  7
Paul H
  • 65,268
  • 20
  • 159
  • 136
  • Thanks, it works fine. Can I just ask you, what does `apply()` specifically do? And I am a little bit confused with `g[g['B']` – jirinovo Dec 15 '14 at 21:24
  • 6
    @jirinovo `groupby.apply(function)` runs every single group through that function and concatenates all of the results. `g[...]` is fancy/boolean indexing -- meaning that it only returns rows where that inner condition is true. In this case, the condition is `g['B'] == g['B'].max()`, e.g., all the rows where the value in column B is equal to the largest value of B within that group. – Paul H Dec 15 '14 at 21:56
  • 3
    Thanks. I personally find it misleading that the `filter` function doesn't filter rows based on some criterion. It feels like the obvious behavior. – gustafbstrom Aug 10 '20 at 13:41
33

EDIT: I just learned a much neater way to do this using the .transform group by method:

def get_max_rows(df):
    B_maxes = df.groupby('A').B.transform(max)
    return df[df.B == B_maxes] 

B_maxes is a series which identically indexed as the original df containing the maximum value of B for each A group. You can pass lots of functions to the transform method. I think once they have output either as a scalar or vector of the same length. You can even pass some strings as common function names like 'median'. This is slightly different to Paul H's method in that 'A' won't be an index in the result, but you can easily set that after.

import numpy as np
import pandas as pd
df_lots_groups = pd.DataFrame(np.random.rand(30000, 3), columns = list('BCD')
df_lots_groups['A'] = np.random.choice(range(10000), 30000)

%timeit get_max_rows(df_lots_groups)
100 loops, best of 3: 2.86 ms per loop

%timeit df_lots_groups.groupby('A').apply(lambda df: df[ df.B == df.B.max()])
1 loops, best of 3: 5.83 s per loop

EDIT:

Here's a abstraction which allows you to select rows from groups using any valid comparison operator and any valid groupby method:

def get_group_rows(df, group_col, condition_col, func=max, comparison='=='):
    g = df.groupby(group_col)[condition_col]
    condition_limit = g.transform(func)
    df.query('condition_col {} @condition_limit'.format(comparison))

So, for example, if you want all rows in above the median B-value in each A-group you call

get_group_rows(df, 'A', 'B', 'median', '>')

A few examples:

%timeit get_group_rows(df_lots_small_groups, 'A', 'B', 'max', '==')
100 loops, best of 3: 2.84 ms per loop
%timeit get_group_rows(df_lots_small_groups, 'A', 'B', 'mean', '!=')
100 loops, best of 3: 2.97 ms per loop
JoeCondron
  • 8,546
  • 3
  • 27
  • 28
  • 10
    I had to cut down on coffee because of how the pandas documentation made my blood pressure shoot up... May I ask where you learned about this thing? Also, allow me to [link to the `transform()` doc page](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.transform.html#pandas.core.groupby.GroupBy.transform) – mccc Feb 08 '16 at 13:41
  • 3
    I love Pandas but the docs, error messages and testing leave something to be desired. I don't remember where I first saw `transform` in use but I'm pretty sure it was here on SO. I often find novel ways of solving problems by looking at questions & answers here. If you use `ipython notebook` you can use tab completion to scan through the various methods, read the docstrings (not great, I know) and just experiment with them (in this case create a group by object and scan its methods) – JoeCondron Feb 08 '16 at 14:16
  • @mccc you need to look at the human-written docs and not the auto-generated references: http://pandas.pydata.org/pandas-docs/stable/groupby.html#transformation – Paul H May 17 '16 at 16:12
  • Would this also work without an aggregation function like max, mean, ...? So for example, if I only want to return the rows of groups where 'some_column == 1'? – Anonymous Jul 26 '19 at 10:54
  • This is really great; any idea how we can apply multiple filter conditions to this? – John Stud Oct 28 '19 at 14:10
  • Not sure exactly what you mean by multiple conditions. My naive answer is that you would generate a Boolean vector for each and chain them with `&`. Can you provide an example? – JoeCondron Nov 06 '19 at 16:27
12

Here's the other example for : Filtering the rows with maximum value after groupby operation using idxmax() and .loc()

In [465]: import pandas as pd

In [466]:   df = pd.DataFrame({
               'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2'],
               'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4'], 
               'value' : [3,2,5,8,10,1]     
                })

In [467]: df
Out[467]: 
   mt   sp  value
0  S1  MM1      3
1  S1  MM1      2
2  S3  MM1      5
3  S3  MM2      8
4  S4  MM2     10
5  S4  MM2      1

### Here, idxmax() finds the indices of the rows with max value within groups,
### and .loc() filters the rows using those indices :
In [468]: df.loc[df.groupby(["mt"])["value"].idxmax()]                                                                                                                           
Out[468]: 
   mt   sp  value
0  S1  MM1      3
3  S3  MM2      8
4  S4  MM2     10
Surya
  • 11,002
  • 4
  • 57
  • 39
4

All of these answers are good but I wanted the following:

(DataframeGroupby object) --> filter some rows out --> (DataframeGroupby object)

Shrug, it appears that is harder and more interesting than I expected. So this one liner accomplishes what I wanted but it's probably not the most efficient way :)

gdf.apply(lambda g: g[g['team'] == 'A']).reset_index(drop=True).groupby(gdf.grouper.names) 

Working Code Example:

import pandas as pd

def print_groups(gdf): 
    for name, g in gdf: 
        print('\n'+name) 
        print(g)

df = pd.DataFrame({'name': ['sue', 'jim', 'ted', 'moe'],
                   'team': ['A', 'A', 'B', 'B'], 
                   'fav_food': ['tacos', 'steak', 'tacos', 'steak']})                               

gdf = df.groupby('fav_food')                                                                                                                                           
print_groups(gdf)                                                                                                                                                      

    steak
        name team fav_food
    1  jim    A    steak
    3  moe    B    steak

    tacos
        name team fav_food
    0  sue    A    tacos
    2  ted    B    tacos

fgdf = gdf.apply(lambda g: g[g['team'] == 'A']).reset_index(drop=True).groupby(gdf.grouper.names)                                                                      
print_groups(fgdf)                                                                                                                                                     

    steak
      name team fav_food
    0  jim    A    steak

    tacos
      name team fav_food
    1  sue    A    tacos
Brian Wylie
  • 2,347
  • 28
  • 29
  • Is there any reason why everyone hates semantic names for variables when you write code in Python? What's the meaning of ```g```. What's the meaning of ```fgdf``` ? – bibscy Jun 26 '20 at 19:23
  • 1
    @bibscy `g` is for grouped (I think). Usually `grouped` is used though. `df` is widely used as `dataframe`. `fgdf` = `final grouped dataframe` ( I presume) – MasayoMusic Jul 02 '20 at 05:55