0

I have a pandas dataframe with several columns, like

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100,size=(100, 7)), columns=list('ABCDEFG'))

and I want to apply to it a function that can accept as arguments all the columns of the dataframe:

# function would do something more complex potentially :)
def foo(a,b,c,d,e,f,g):
  # do stuff with a,b,c,d,e,f,g. Here I do something silly/simple
  return a + b*2 + c*3 + d*4 + e*5 + f*5 + g*5

Now, I would like to apply foo to all rows of df. What's the proper syntax to do so?

My attempts work

df.apply(lambda row: foo(row[0], row[1], row[2], row[3], row[4], row[5], row[6]), axis = 1) # terrible
df.apply(lambda row: foo(*row), axis = 1) #  better

but is there a way to do it even more concisely, e.g. without lambda?

Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
  • if your args in your function match your col names, you could create a dict and then iterate over that which would be less code. you could also zip both iterables, and apply the function until the end of the iterable. – Umar.H Feb 04 '20 at 10:49
  • why do you want use apply here? https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code – ansev Feb 04 '20 at 10:53

2 Answers2

1

Here's a way to pass all of the columns of the dataframe to the function without using apply or lambdas.

foo(*df.to_numpy().T)

That returns a numpy array. If you need it to return a pandas Series with the same index as the input, you can do this:

 pd.Series(foo(*df.to_numpy().T), index=df.index)

Turns out it's way faster than the lambda method (at least for me running python 3.5).

>>> import timeit
>>> timeit.timeit("df.apply(lambda row: foo(*row), axis = 1)", setup="from __main__ import foo, df", number=10)    
0.028233799999981102
>>> timeit.timeit("pd.Series(foo(*df.to_numpy().T), index=df.index)", setup="from __main__ import foo, df, pd", number=10)
0.0019406999999773689
>>> timeit.timeit("foo(*df.to_numpy().T)", setup="from __main__ import foo, df", number=10)                        
0.0004090000000189775

That's 69x faster when returning a numpy array and 15x faster when returning a pandas Series and keeping the index!

Andrew
  • 46
  • 2
0

simple tweak to function will do the trick

def foo(a=df['A'],b=df['B'],c=df['C'],d=df['D'],e=df['E'],f=df['F'],g=df['G']):
    return a + b*2 + c*3 + d*4 + e*5 + f*5 + g*5

df.apply(foo)

       A     B     C     D     E     F     G
0    755   731   736   745   717   696   697
1   1365  1330  1321  1323  1332  1348  1367
2    985  1002   971   982  1012  1017  1052
3   1078  1016  1094  1034  1034  1049  1102
4   1045  1059  1041  1101  1100  1025  1041
..   ...   ...   ...   ...   ...   ...   ...
95  1318  1338  1341  1349  1357  1356  1358
96  1323  1387  1349  1321  1315  1370  1389
97  1066  1101  1057  1098  1132  1078  1067
98  1261  1229  1273  1312  1283  1296  1231
99  1585  1522  1537  1590  1591  1558  1548

[100 rows x 7 columns]

Update

df.apply(lambda x: x['A'] + x['B']*2 + x['C']*3 + x['D']*4 + x['E']*5 + x['F']*5 + x['G']*5,1)

0      755
1     1365
2      985
3     1078
4     1045
      ... 
95    1318
96    1323
97    1066
98    1261
99    1585
Length: 100, dtype: int64
iamklaus
  • 3,720
  • 2
  • 12
  • 21
  • Thanks but... expected result is a series of len(100) in my example! (mind the axis = 1 passed to apply, results are fed row by row) – Davide Fiocco Feb 04 '20 at 10:50
  • i have updated with a simpler version..in your case a loop is required since you need the individual values and want to perform operation on it..can't imagine any other way..mind i can be wrong too. – iamklaus Feb 04 '20 at 10:58
  • `df.mul(pd.Series(data = [1,2,3,4,5,5,5],index = df.columns)).sum(axis = 1)` – ansev Feb 04 '20 at 11:03
  • @ansev the idea is to have a function with many arguments, not simply multiplying column values, I will make the question more generic... – Davide Fiocco Feb 04 '20 at 13:36