11

Is it possible to do a groupby transform with custom functions?

data = {
        'a':['a1','a2','a3','a4','a5'],
        'b':['b1','b1','b2','b2','b1'],
        'c':[55,44.2,33.3,-66.5,0],
        'd':[10,100,1000,10000,100000],
        }

import pandas as pd
df = pd.DataFrame.from_dict(data)

df['e'] = df.groupby(['b'])['c'].transform(sum) #this works as expected
print (df)
#    a   b     c       d     e
#0  a1  b1  55.0      10  99.2
#1  a2  b1  44.2     100  99.2
#2  a3  b2  33.3    1000 -33.2
#3  a4  b2 -66.5   10000 -33.2
#4  a5  b1   0.0  100000  99.2

def custom_calc(x, y):
    return (x * y)

#obviously wrong code here
df['e'] = df.groupby(['b'])['c'].transform(custom_calc(df['c'], df['d'])) 

As we can see from the above example, what I want is to explore the possibility of being able to pass in a custom function into .transform().

I am aware that .apply() exists, but I want to find out if it is possible to use .transform() exclusively.

More importantly, I want to understand how to formulate a proper function that can be passed into .transform() for it to apply correctly.

P.S. Currently, I know default functions like 'count', sum, 'sum', etc works.

ycx
  • 3,155
  • 3
  • 14
  • 26
  • 1
    related: https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object/47143056#47143056 – Dani Mesejo Jan 07 '19 at 18:10
  • 1
    I don't understand the operation. Do you want the total product of `c` and `d`? Or per-row? A desired output would be very helpful here. – user3483203 Jan 07 '19 at 18:17
  • 1
    `.transform` most easily acts on a single series. If you want to transform with a function that requires multiple series as the input it can be done, though it's rather annoying and can often be done in other ways that avoid the transform (i.e. a map) – ALollz Jan 07 '19 at 18:43
  • @ALollz You’re right. But my intention is on streamlining code and less so on optimisation for this question. Hence the reason why I wish to know if there is a template for creating functions that can be passed into .transform() – ycx Jan 07 '19 at 18:57
  • @user3483203 The code provided is a sample. It can be any kind of function calculation. The real gist of the question is in bold font. I want to know if a template for creating such functions to be passed into .transform() exists for the purpose of streamlining code – ycx Jan 07 '19 at 19:00
  • Well I think part of the confusion is that the function you provided isn't really one that's a good candidate for `.groupby.transform`. Since yours returns a `Series` or array there's really nothing to transform, as that's typically used to broadcast a scalar group value back to all other members of the group in the original `DataFrame` – ALollz Jan 07 '19 at 19:08

1 Answers1

16

One way I like to see what is happening is by creating a small custom function and printing out what is passed and its type. Then, you can see you have to work with.

def f(x):
    print(type(x))
    print('\n')
    print(x)
    print(x.index)
    return df.loc[x.index,'d']*x

df['f'] = df.groupby('b')['c'].transform(f)
print(df)

#Output from print statements in function
<class 'pandas.core.series.Series'>


0    55.0
1    44.2
4     0.0
Name: b1, dtype: float64
Int64Index([0, 1, 4], dtype='int64')
<class 'pandas.core.series.Series'>


2    33.3
3   -66.5
Name: b2, dtype: float64
Int64Index([2, 3], dtype='int64')
#End output from print statements in custom function

    a   b     c       d     e         f
0  a1  b1  55.0      10  99.2     550.0
1  a2  b1  44.2     100  99.2    4420.0
2  a3  b2  33.3    1000 -33.2   33300.0
3  a4  b2 -66.5   10000 -33.2 -665000.0
4  a5  b1   0.0  100000  99.2       0.0

Here, I am transforming on column 'c' but I make an "extranal" call to the dataframe object in my custom function to get 'd'.

You can also pass the "external" to be used as an argument like this:

def f(x, col):
    return df.loc[x.index, col]*x

df['g'] = df.groupby('b')['c'].transform(f, col='d')

print(df)

Output:

    a   b     c       d     e         f         g
0  a1  b1  55.0      10  99.2     550.0     550.0
1  a2  b1  44.2     100  99.2    4420.0    4420.0
2  a3  b2  33.3    1000 -33.2   33300.0   33300.0
3  a4  b2 -66.5   10000 -33.2 -665000.0 -665000.0
4  a5  b1   0.0  100000  99.2       0.0       0.0
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • In your `.transform(f, col='d')`, is it by default assumed that the first argument passed will always be the Series as outlined in `'c'` into `f(x)`? How does `sum` and `'count'` work exactly then? Because these default functions do not include `x` or is it by default assumed that `sum` is really `sum(x)` and `'count'` is really `'count(x)'`? Is that really the case as inferred by your code? How is `.transform()` able to evaluate `'count(x)'` when its in a string format in this case? Thank you for the detailed answer! – ycx Jan 08 '19 at 01:51
  • 1
    great answer Scott, especially in the 2nd part when you return 1 more column. Loved yu using index on df in 2nd part. I have been looking for something similar for 2 days and here it is :) – rishi jain Jun 14 '20 at 11:41
  • 1
    I dont understand 'df' in f function, why is it there? It is not defined anywhere. – Dinh Quang Tuan Dec 06 '20 at 12:46
  • 1
    'df' was defined in the question. And it is defined outside the function in the main program. That variable is avaible in the function even thought it wasn't passed. – Scott Boston Dec 06 '20 at 20:31
  • @ScottBoston: Thank you so much. Can you explain variable 'x' in the function and how it is used when 'transform'. I am confused with return df.loc[x.index, col]*x . – Dinh Quang Tuan Dec 07 '20 at 05:04
  • We can see in the top of this solution that when you do `df.groupby('b')['c'].transform(f)` you are passing a pd.Series to the custom function, f. So x, in tihs example is a pd.Series of column 'c' in two parts defined by 'b'. So, referring to the original dataframe, we can use loc to get the values from column defined by 'col' and multiply by the value in the pd.Series at each index. – Scott Boston Dec 07 '20 at 13:31