16

I'm trying to apply a custom function in pandas similar to the groupby and mutate functionality in dplyr.

What I'm trying to do is say given a pandas dataframe like this:

df = pd.DataFrame({'category1':['a','a','a', 'b', 'b','b'],
  'category2':['a', 'b', 'a', 'b', 'a', 'b'],
  'var1':np.random.randint(0,100,6),
  'var2':np.random.randint(0,100,6)}
)

df
  category1 category2  var1  var2
0         a         a    23    59
1         a         b    54    20
2         a         a    48    62
3         b         b    45    76
4         b         a    60    26
5         b         b    13    70

apply some function that returns the same number of elements as the number of elements in the group by:

def myfunc(s):
  return [np.mean(s)] * len(s)

to get this result

df
  category1 category2  var1  var2   var3
0         a         a    23    59   35.5
1         a         b    54    20   54
2         a         a    48    62   35.5
3         b         b    45    76   29
4         b         a    60    26   60
5         b         b    13    70   29

I was thinking of something along the lines of:

df['var3'] = df.groupby(['category1', 'category2'], group_keys=False).apply(lambda x: myfunc(x.var1))

but haven't been able to get the index to match.

In R with dplyr this would be

df <- df %>%
  group_by(category1, category2) %>%
  mutate(
    var3 = myfunc(var1)
  )

So I was able to solve it by using a custom function like:

def myfunc_data(data):

  data['var3'] = myfunc(data.var1)
  return data

and

df = df.groupby(['category1', 'category2']).apply(myfunc_data)

but I guess I was still wondering if there's a way to do it without defining this custom function.

jtanman
  • 654
  • 1
  • 4
  • 18

4 Answers4

14

Use GroupBy.transform for return Series with same size like original DataFrame, so possible assign to new column:

np.random.seed(123)

df = pd.DataFrame({'category1':['a','a','a', 'b', 'b','b'],
  'category2':['a', 'b', 'a', 'b', 'a', 'b'],
  'var1':np.random.randint(0,100,6),
  'var2':np.random.randint(0,100,6)}
)

df['var3'] = df.groupby(['category1', 'category2'])['var1'].transform(myfunc)
print (df)
  category1 category2  var1  var2  var3
0         a         a    66    86    82
1         a         b    92    97    92
2         a         a    98    96    82
3         b         b    17    47    37
4         b         a    83    73    83
5         b         b    57    32    37

Alternative with lambda function:

df['var3'] = (df.groupby(['category1', 'category2'])['var1']
                .transform(lambda s: [np.mean(s)] * len(s)))
mneumann
  • 713
  • 2
  • 9
  • 42
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Ah got it, just in general are there good resources to understand like trying to do index matching, groupby, apply, etc.? – jtanman Apr 12 '19 at 05:08
  • @jtanman - About aggregation is possible check [this](https://stackoverflow.com/questions/53781634/aggregation-in-pandas). – jezrael Apr 12 '19 at 05:09
  • How would you do this yet pass in 'category1' as an argument to `myfunc`? How do you pass in arguments coming from the immediate adjacent cells so it uniformly places results in the newly generated column/cell? – rom May 19 '21 at 15:04
  • @rom - Use `s.name[0]` for get values of `category1` in `myfunc` – jezrael May 20 '21 at 05:05
4

Try the following solution:

df.loc[:,'var3'] = df.groupby(['category1', 'category2']).var1.transform(myfunc)
bubble
  • 1,634
  • 12
  • 17
1

You can technically achieve this using apply, which I'll add here for completeness, but I would recommend using the transform method – it's simpler and faster.

The problem you had was that you returned multiple values which, when using apply, gives you a list for each row. Instead, you can return just a single value and rely upon pandas to be smart about matching up those values (and duplicating where necessary) when you add in the new column. To make this work, though, we need to have the same index as the series returned from groupby/apply. Here's how you could do this (note the modification to myfunc too):

import pandas as pd

def myfunc(s):
    return np.mean(s)

df = pd.DataFrame({'category1':['a','a','a', 'b', 'b','b'],
  'category2':['a', 'b', 'a', 'b', 'a', 'b'],
  'var1':np.random.randint(0,100,6),
  'var2':np.random.randint(0,100,6)}
)

df = (df.set_index(["category1", "category2"])
         .assign(var3=df.groupby(["category1", "category2"]).var1.apply(myfunc))
         .reset_index()
      )
df
feetwet
  • 3,248
  • 7
  • 46
  • 84
Nathan
  • 9,651
  • 4
  • 45
  • 65
  • Ah thanks! Yeah I knew my issue was with matching indices because I was able to generate the entire list except it would be one cell instead of matching it to the groupby data as a column. – jtanman Apr 12 '19 at 05:31
1

Easy to replicate this in python using datar

>>> from datar.all import tibble, sample, mean
>>> from pipda import register_func
>>> 
>>> df = tibble(
...   category1=['a','a','a', 'b', 'b','b'],
...   category2=['a', 'b', 'a', 'b', 'a', 'b'],
...   # var1=sample(100, 6),
...   # var2=sample(100, 6)
...   var1=[23, 54, 48, 45, 60, 13],
...   var2=[59, 20, 62, 76, 26, 70]
... )
>>> df
  category1 category2    var1    var2
   <object>  <object> <int64> <int64>
0         a         a      23      59
1         a         b      54      20
2         a         a      48      62
3         b         b      45      76
4         b         a      60      26
5         b         b      13      70
>>>
>>> @register_func(None)
>>> def myfunc(s):
...     return mean(s)
>>>
>>> df >> group_by(
...     f.category1, f.category2
... ) >> mutate(
...     var3 = myfunc(f.var1)
... )
  category1 category2    var1    var2      var3
   <object>  <object> <int64> <int64> <float64>
0         a         a      23      59      35.5
1         a         b      54      20      54.0
2         a         a      48      62      35.5
3         b         b      45      76      29.0
4         b         a      60      26      60.0
5         b         b      13      70      29.0

[Groups: category1, category2 (n=4)]

Disclaimer: I am the author of the datar package.

Panwen Wang
  • 3,573
  • 1
  • 18
  • 39