Apply function on a particular column of a dataframe

Question

def include_mean():
    if pd.isnull('Age'):
        if 'Pclass'==1:
            return 38
        elif 'Pclass'==2:
            return 30
        elif 'Pclass'==3:
            return 25
        else: return 'Age'

train['Age']=train[['Age','Pclass']].apply(include_mean(),axis=1)

why is the above code giving me a type error.

 TypeError: ("'NoneType' object is not callable", 'occurred at index 0')

I now know the right code which is

def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]

    if pd.isnull(Age):
if Pclass == 1:
            return 37
elif Pclass == 2:
            return 29
else:
            return 24
else:
        return Age

train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

Now I want to know why are the changes required i.e. the exact reasons behind them. What is 'cols' doing here.

Can you fix some of the indentation in your examples? It's a bit unclear where the else statements come in. Also the logic in the top and bottom cases seem totally different? (top has a Pclass=-3, while bottom does not) — ALollz, Sep 21 '19 at 14:10
Regardless, I think you may want to read this post: https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column?noredirect=1&lq=1. `numpy.select` is likely the best way to create your new column when you need to implement `elif` logic. — ALollz, Sep 21 '19 at 14:11

score 1 · Answer 1 · answered Sep 21 '19 at 13:30

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

When you're using the apply method on a panda dataframe, the function you pass to apply is called on every column(or row, depending on the axis parameter which defaults to 0, the column axis). So your function must have a parameter for the row which apply will pass to it.

def include_mean():
    if pd.isnull('Age'):
        if 'Pclass'==1:
            return 38
        elif 'Pclass'==2:
            return 30
        elif 'Pclass'==3:
            return 25
        else: return 'Age'

There are a few issues with this.

'Pclass'==1: is guaranteed to be False, since you're comparing a string('Pclass') and an integer(1), which cannot be equal. What you want is to compare the value of the Pclass entry of a column, which you can retrieve by indexing the column: col["Pclass"], or col[1] if Pclass is the second column.
if pd.isnull('Age') is False, the function returns None. Since the string 'Age' is not null, this should always be the case. When you do d.apply(include_mean()), you're calling include_mean, which returns None, and then pass that value to apply. But apply expects a callable(e.g. a function).
In the else clause, you're returning the string 'Age'. This means your dataframe would have had the value 'Age' in some cells.

Your second sample fixes those issues: the impute_age function now takes a parameters for the row(cols), the value of the Age and Pclass columns are looked up and compared, and you pass that function without calling it to the apply method.

score 0 · Answer 2 · answered Sep 21 '19 at 12:38

0

Welcome to Python. To answer your question, especially at the beginning phase, sometimes you just need to crack open a fresh IPython notebook and try stuff out:

In [1]: import pandas as pd
   ...: def function(x):
   ...:     return x+1
   ...:
   ...: df = pd.DataFrame({'values':range(10)})
   ...: print(df)
   ...:
   values
0       0
1       1
2       2
3       3
4       4
5       5
6       6
7       7
8       8
9       9

In [2]: print(df.apply(function))
   values
0       1
1       2
2       3
3       4
4       5
5       6
6       7
7       8
8       9
9      10

In your question, cols is the value for each row you're looping over.

answered Sep 21 '19 at 12:38

user1717828

7,122
8
34
59

still did not get it, while calling the function the code is not passing on the values for the 'cols' list. Can you please elaborate your answer? – Saurav Jaswal Sep 21 '19 at 12:49
@SauravJaswal Nope. I'll take down the answer. Good luck! – user1717828 Sep 21 '19 at 13:49

ALollz · Answer 3 · 2019-09-21T18:22:25.373

0

Do not use apply(axis=1). Instead, you should set the values on a subset using .loc. This is a simple mapping for the top case.

m = train.Age.isnull()
d = {1: 38, 2: 30, 3: 25}

train.loc[m, 'Age'] = train.loc[m, 'Pclass'].map(d)

For the bottom case because of the else clause we can use np.select. The way this works is we create a condition list that follows the order of the if, elif else logic. Then we supply a choice list to select from when we encounter the first True. Since you have nested logic we need to first unnest it so that it logically reads as

if age is null and pclass == 1
elif age is null and pclass == 2
elif age is null 
else

Sample Data

import pandas as pd
import numpy as np

df = pd.DataFrame({'Age': [50, 60, 70, np.NaN, np.NaN, np.NaN, np.NaN],
                   'Pclass': [1, 1, 1, 1, 2, np.NaN, 1]})
#    Age  Pclass
#0  50.0     1.0
#1  60.0     1.0
#2  70.0     1.0
#3   NaN     1.0
#4   NaN     2.0
#5   NaN     NaN
#6   NaN     1.0

m = df.Age.isnull()
conds = [m & df.Pclass.eq(1),
         m & df.Pclass.eq(2),
         m]
choices = [37, 29, 24]

df['Age'] = np.select(conds, choices, default=df.Age)
                                      # |
                                      # Takes care of else, i.e. Age not null
print(df)
#    Age  Pclass
#0  50.0     1.0
#1  60.0     1.0
#2  70.0     1.0
#3  37.0     1.0
#4  29.0     2.0
#5  24.0     NaN
#6  37.0     1.0

edited Sep 21 '19 at 18:22

answered Sep 21 '19 at 12:53

ALollz

57,915
7
66
89

Completely good-willed question, no malice intended, I promise: did you realize you were being the caricature of a bad SO answerer when you posted this? OP is clearly curious about learning how `apply` gets an argument passed, and your answer is to not use `apply` in this case, but `loc`. Again, not trying to be rude, I just wanted to know if you were aware or if you are so tunnel-visioned you can't see how answering something other than how `apply`'s argument works can only frustrate a beginner. – user1717828 Sep 21 '19 at 13:53
@user1717828 The issue is that this is the XY problem. The user needs to map null values of age based on `Pclass`. They believe that this should be done with `apply`, iterating over the rows, so they ask a question about why apply does't work. The problem is that there's no sense in dealing with that problem because `apply` is 100% not the way to answer this problem. It's horribly inefficient and just should not be used. There's a reason this answer has 144 upvotes: https://stackoverflow.com/a/55557758/4333359 – ALollz Sep 21 '19 at 14:02
@user1717828 so yes, perhaps my answer was a bit blunt, but honestly I'd rather focus my time showing users the proper way to solve their problem with the correct tools, instead of trying to debug a tool they shouldn't be using in the first place. But your point is taken, I will add a section illustrating "how" one could make it work with `apply`, and why it's not useful. – ALollz Sep 21 '19 at 14:03

Apply function on a particular column of a dataframe

3 Answers3

Sample Data