3

I understand that when you call a groupby.transform with a DataFrame column, the column is passed to the function that transforms the data. But what I cannot understand is how to pass multiple columns to the function.

people = DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
key = ['one', 'two', 'one', 'two', 'one']

Now I can easily demean that data etc. but what I can't seem to do properly is to transform data inside groups using multiple column values as parameters of the function. For example if I wanted to add a column 'f' that took the value a.mean() - b.mean() * c for each observation how can that be achived using the transform method.

I have tried variants of the following

people['f'] = float(NA)
Grouped = people.groupby(key)
def TransFunc(col1, col2, col3):
    return col1.mean() - col2.mean() * col3
Grouped.f.transform(TransFunc(Grouped['a'], Grouped['b'], Grouped['c']))

But this is clearly wrong. I have also trued to wrap the function in a lamba but can't quite make that work either.

I am able to achieve the result by iterating through the groups in the following manner:

for group in Grouped:
    Amean = np.mean(list(group[1].a))
    Bmean = np.mean(list(group[1].b))
    CList = list(group[1].c)
    IList = list(group[1].index)

    for y in xrange(len(CList)):
        people['f'][IList[y]] = (Amean - Bmean) * CList[y]

But that does not seem a satisfactory solution, particulalry if the index is non-unique. Also I know this must be possible using groupby.transform.

To generalise the question: how does one write functions for transforming data that have parameters that involve using values from multiple columns?

Help appreciated.

Woody Pride
  • 13,539
  • 9
  • 48
  • 62

2 Answers2

6

You can use apply() method:

import numpy as np
import pandas as pl
np.random.seed(0)

people2 = pd.DataFrame(np.random.randn(5, 5), 
                      columns=['a', 'b', 'c', 'd', 'e'], 
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
key = ['one', 'two', 'one', 'two', 'one']

Grouped = people2.groupby(key)

def f(df):
    df["f"] = (df.a.mean() - df.b.mean())*df.c
    return df

people2 = Grouped.apply(f)
print people2

If you want some generalize method:

Grouped = people2.groupby(key)

def f(a, b, c, **kw):
    return (a.mean() - b.mean())*c

people2["f"] = Grouped.apply(lambda df:f(**df))
print people2
HYRY
  • 94,853
  • 25
  • 187
  • 187
  • Thanks, that seems to work well. I have some back up questions if you don't mind. 1) What is being passed to the function f when it is called with apply? Is it each groupe of data sequentially? I assume it must be. 2) How can the function be called with multiple columns so people2 = Grouped.apply(f('a', 'b', 'c'))? Clearly the fucntion would have to be changed, but in your example the function is not very abstract. I would want to write def f(df, col1, col2, col3) - so that it could be used beyond the columns referenced inside the function. – Woody Pride Oct 28 '13 at 06:05
  • Yes, because in your example the functionc can only be used to return column f as modified based on inputs from col a b and c. How to generalize this? – Woody Pride Oct 28 '13 at 06:13
  • Yes, I will need to revise using kwargs, but that seems to be just about right. I would want to generalise such that I could use columns d, e, f etc. so i would enter those as arguments when calling the function. I supplied an answer entriely based on yours. Do you think it makes sense? thanks for your help, this has been bothering me for some time. – Woody Pride Oct 28 '13 at 07:50
  • 1
    +1, the main part of answer I think is to use apply instead of transform – Roman Pekar Oct 28 '13 at 08:57
  • Would it be right to say then that calling transform passes only the named column, or each column in the DF to the function individually and it is not possible to pass more than one column, whereas apply passes the whole data frame and then column values can be used within the function? I think that was where I was getting it wrong... – Woody Pride Oct 28 '13 at 10:25
0

This is based upon the answer provided by HYRY (thanks) who made me see how this could be achieved. My version does nothing more than generalise the function and enter the arguments of the function when it is called. I think though the function has to be called with a lambda:

import pandas as pd
import numpy as np
people = DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'], index=['Joe',         'Steve', 'Wes', 'Jim', 'Travis'])
key = ['one', 'two', 'one', 'two', 'one']
people['f'] = ""
Grouped = people.groupby(key)

def FUNC(df, col1, col2, col3, col4):
    df[col1] = (df[col2].mean() - df[col3].mean())*df[col4]
    return df

people2 = Grouped.transform(lambda x: FUNC(x, 'f', 'a', 'b', 'c'))

This appears to me to be the best way I have seen of doing this... Basically the entire grouped data frame is passed to the function as x, and then columns can be called as arguments.

Woody Pride
  • 13,539
  • 9
  • 48
  • 62