4

Trying to use apply-split-combine pandas transform. With the twist that the apply function needs to operate on multiple columns. It seems I can't get it to work using pd.transform and have to go indirect via pd.apply. There a way to do

import pandas as pd
import numpy as np

df = pd.DataFrame({'Date':[1,1,1,2,2,2],'col1':[1,2,3,4,5,6],'col2':[1,2,3,4,5,6]})
col1 = 'col1'
col2 = 'col2'
def calc(dfg):
    nparray = np.array(dfg[col1])
    somecalc = np.array(dfg[col2])
    # do something with somecalc that helps caculate result
    return(nparray - nparray.mean()) #just some dummy data, the function does a complicated calculation

#===> results in: KeyError: 'col1'
df['colnew'] = df.groupby('Date')[col1].transform(calc)

#===> results in: ValueError: could not broadcast input array from shape (9) into shape (9,16) or TypeError: cannot concatenate a non-NDFrame object
df['colnew'] = df.groupby('Date').transform(calc)

#===> this works but feels unnecessary 
def applycalc(df):
    df['colnew'] = calc(df)
    return(df)

df = df.groupby('Date').apply(applycalc)

This post is the closest I found. I would prefer to not pass in all the columns as separate parameters, besides the fact that there is a groupby operation.

EDIT: Note that I'm not really trying to calculate nparray - nparray.mean() that's just a dummy calculation. It does something complicated which returns an array of shape (group_length,1). Also I want to store colnew as a new column in the original dataframe.

Community
  • 1
  • 1
citynorman
  • 4,918
  • 3
  • 38
  • 39

1 Answers1

2

You can do the groupby then the subtract rather than at once:

In [11]: df["col1"] - df.groupby('Date')["col1"].transform("mean")
Out[11]:
0   -1
1    0
2    1
3   -1
4    0
5    1
dtype: int64

In this case, you can't use transform since the function returns multiple values/array/series:

In [21]: def calc2(dfg):
             return dfg["col1"] - dfg["col1"].mean()

In [22]: df.groupby('Date', as_index=True).apply(calc2)
Out[22]:
Date
1     0   -1
      1    0
      2    1
2     3   -1
      4    0
      5    1
Name: col1, dtype: float64

Note it's important to return a series or it won't align:

In [23]: df.groupby('Date').apply(calc)
Out[23]:
Date
1    [-1.0, 0.0, 1.0]
2    [-1.0, 0.0, 1.0]
dtype: object
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • How can I assign the result back as a new column? `df['newcol'] = df.groupby('Date', as_index=True).apply(calc2)` gives error since the rhs has different index as lhs. This won't be error if we can use `transform` instead of `apply` – jf328 Jul 04 '17 at 21:53
  • @jf328 IIUC I guess one option is to reset_index and merge... if not please ask a new question (perhaps someone else has a better idea!) – Andy Hayden Jul 05 '17 at 06:06
  • Yes, I guess reset_index and merge is a good solution. Currently I add a `.values` at the end of rhs so index is ignored, but I'm very worried about order mismatch although .groupby(sort = False) does promise to keep order. – jf328 Jul 05 '17 at 08:36