250

I have some problems with the Pandas apply function, when using multiple columns with the following dataframe

df = DataFrame ({'a' : np.random.randn(6),
                 'b' : ['foo', 'bar'] * 3,
                 'c' : np.random.randn(6)})

and the following function

def my_test(a, b):
    return a % b

When I try to apply this function with :

df['Value'] = df.apply(lambda row: my_test(row[a], row[c]), axis=1)

I get the error message:

NameError: ("global name 'a' is not defined", u'occurred at index 0')

I do not understand this message, I defined the name properly.

I would highly appreciate any help on this issue

Update

Thanks for your help. I made indeed some syntax mistakes with the code, the index should be put ''. However I still get the same issue using a more complex function such as:

def my_test(a):
    cum_diff = 0
    for ix in df.index():
        cum_diff = cum_diff + (a - df['a'][ix])
    return cum_diff 
smci
  • 32,567
  • 20
  • 113
  • 146
Andy
  • 9,483
  • 12
  • 38
  • 39
  • 1
    Avoid using `apply` as much as possible. If you're not sure you need to use it, you probably don't. I recommend taking a look at [When should I ever want to use pandas apply() in my code?](https://stackoverflow.com/q/54432583/4909087). – cs95 Jan 30 '19 at 10:22
  • This is just about syntax errors referencing a dataframe column, and why do functions need arguments. As to your second question, the function `my_test(a)` doesn't know what `df` is since it wasn't passed in as an argument (unless `df` is supposed to be a global, which would be terrible practice). You need to pass all the values you'll need inside a function as arguments (preferably in order), otherwise how else would the function know where `df` comes from? Also, it's bad practice to program in a namespace littered with global variables, you won't catch errors like this. – smci Mar 04 '19 at 02:43

6 Answers6

397

Seems you forgot the '' of your string.

In [43]: df['Value'] = df.apply(lambda row: my_test(row['a'], row['c']), axis=1)

In [44]: df
Out[44]:
                    a    b         c     Value
          0 -1.674308  foo  0.343801  0.044698
          1 -2.163236  bar -2.046438 -0.116798
          2 -0.199115  foo -0.458050 -0.199115
          3  0.918646  bar -0.007185 -0.001006
          4  1.336830  foo  0.534292  0.268245
          5  0.976844  bar -0.773630 -0.570417

BTW, in my opinion, following way is more elegant:

In [53]: def my_test2(row):
....:     return row['a'] % row['c']
....:     

In [54]: df['Value'] = df.apply(my_test2, axis=1)
waitingkuo
  • 89,478
  • 28
  • 112
  • 118
  • Thanks, You are right I forgot the ''. However I have still the same issue with a more complex function. I would highly appreciate your help with that. Thanks – Andy May 03 '13 at 08:58
  • 5
    @Andy following [53-54] allow you to apply more complex functions. – Andy Hayden May 03 '13 at 09:29
  • @Andy you can define your complex function like the In[53] way. – waitingkuo May 03 '13 at 09:37
  • do all apply strategies perform the same? I'm new to pandas and have always found apply slightly enigmatic but your strategy in [53-54] is easy for me to understand (and hopefully remember) ... on a large table is it as quick as the other form of apply presented? – whytheq Sep 04 '16 at 09:48
  • Why is it that creating a separate method is considered more elegant - even for tiny methods. I have been doing significant projects in python for 7 years but will likely never be considered a `pythonista` due to some perspectives including this one. – WestCoastProjects Oct 20 '18 at 14:59
  • 3
    `axis=1` is important here – Luis Apr 10 '19 at 09:52
34

If you just want to compute (column a) % (column b), you don't need apply, just do it directly:

In [7]: df['a'] % df['c']                                                                                                                                                        
Out[7]: 
0   -1.132022                                                                                                                                                                    
1   -0.939493                                                                                                                                                                    
2    0.201931                                                                                                                                                                    
3    0.511374                                                                                                                                                                    
4   -0.694647                                                                                                                                                                    
5   -0.023486                                                                                                                                                                    
Name: a
herrfz
  • 4,814
  • 4
  • 26
  • 37
  • 16
    I know, it is just an example to show my problem in applying a function to multiple columns – Andy May 03 '13 at 08:22
18

Let's say we want to apply a function add5 to columns 'a' and 'b' of DataFrame df

def add5(x):
    return x+5

df[['a', 'b']].apply(add5)
Mir_Murtaza
  • 321
  • 2
  • 4
  • I am getting following error while trying your code snippet. TypeError: ('must be str, not int', 'occurred at index b') can you please look into that. – Debashis Sahoo Aug 08 '18 at 05:55
  • The column b of your dataframe is a string type or object type column, it should be an integer column to be added with a number. – Mir_Murtaza Aug 08 '18 at 07:59
  • Wouldn't the changes only apply after assignment? – S.aad May 28 '20 at 08:03
11

All of the suggestions above work, but if you want your computations to by more efficient, you should take advantage of numpy vector operations (as pointed out here).

import pandas as pd
import numpy as np


df = pd.DataFrame ({'a' : np.random.randn(6),
             'b' : ['foo', 'bar'] * 3,
             'c' : np.random.randn(6)})

Example 1: looping with pandas.apply():

%%timeit
def my_test2(row):
    return row['a'] % row['c']

df['Value'] = df.apply(my_test2, axis=1)

The slowest run took 7.49 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 3: 481 µs per loop

Example 2: vectorize using pandas.apply():

%%timeit
df['a'] % df['c']

The slowest run took 458.85 times longer than the fastest. This could mean that an intermediate result is being cached. 10000 loops, best of 3: 70.9 µs per loop

Example 3: vectorize using numpy arrays:

%%timeit
df['a'].values % df['c'].values

The slowest run took 7.98 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 6.39 µs per loop

So vectorizing using numpy arrays improved the speed by almost two orders of magnitude.

Blane
  • 643
  • 8
  • 13
  • Results change even more dramatically for big numbers, e.g. replacing 6 with 10K, I get 248 ms, 332 µs, 263 µs respectively. So both vectorized solutions are much closer to each other, but the non-vectorized solution is 1000 times slower. (tested on python-3.7) – stason Feb 05 '20 at 02:49
3

This is same as the previous solution but I have defined the function in df.apply itself:

df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)
shaurya airi
  • 375
  • 4
  • 6
2

I have given the comparison of all three discussed above.

Using values

%timeit df['value'] = df['a'].values % df['c'].values

139 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Without values

%timeit df['value'] = df['a']%df['c'] 

216 µs ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Apply function

%timeit df['Value'] = df.apply(lambda row: row['a']%row['c'], axis=1)

474 µs ± 5.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Gursewak Singh
  • 172
  • 1
  • 6