Apply a function to modify multiple columns with if else logic

Question

I am trying to write a function with if-else logic which will modify two columns in my data frame. But its not working. Following is my function

def get_comment_status(df):
    if df['address'] == 'NY':
        df['comment'] = 'call tomorrow'
        df['selection_status'] = 'interview scheduled'
        return df['comment'] 
        return df['selection_status']
    else:
        df['comment'] = 'Dont call'
        df['selection_status'] = 'application rejected'
        return df['comment']
        return df['selection_status']

and then execute the function as :

df[['comment', 'selection_status']] = df.apply(get_comment_status, axis = 1)

But I am getting error. What am I doing wrong ? My guess is probably the df.apply() syntax is wrong

Error Message:

TypeError: 'str' object cannot be interpreted as an integer KeyError:('address', 'occurred at index 0')

sample dataframe:

df = pd.DataFrame({'address': ['NY', 'CA', 'NJ', 'NY', 'WS', 'OR', 'OR'],
               'name1': ['john', 'mayer', 'dylan', 'bob', 'mary', 'jake', 'rob'],
               'name2': ['mayer', 'dylan', 'mayer', 'bob', 'bob', 'tim', 'ben'],
               'comment': ['n/a', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a', 'n/a'],
               'score': [90, 8, 88, 72, 34, 95, 50],
               'selection_status': ['inprogress', 'inprogress', 'inprogress', 'inprogress', 'inprogress', 'inprogress', 'inprogress']})

I have also thought of using lambda function but it doesnt work as I was trying to assign value to 'comment' and 'selection_status' column using '='

Note: I have checked this question which is similar by title but doesn't solve my problem.

Look at your return statements: only the first one in each branch gets executed. You'll need to return something else, essentially both values at the same time. — 9769953, Jun 12 '18 at 22:38
Note that `.apply` doesn't work on a dataframe, but on a row. For your code, it doesn't matter, but the naming of your variable `df` in your function implies you're thinking incorrectly about apply, which will cause confusion later on. — 9769953, Jun 12 '18 at 22:39

score 2 · Accepted Answer · answered Jun 12 '18 at 22:40

2

What you try to do is not very consistent with Pandas philosophy. Also, apply is a very inefficient function. You probably should use Numpy where:

import numpy as np
df['comment'] = np.where(df['address'] == 'NY',
                  'call tomorrow', 'Dont call')
df['selection_status'] = np.where(df['address'] == 'NY',
                           'interview scheduled', 'application rejected')

Or boolean indexing:

df.loc[df['address'] == 'NY', ['comment', 'selection_status']] \
         = 'call tomorrow', 'interview scheduled'
df.loc[df['address'] != 'NY', ['comment', 'selection_status']] \
         = 'Dont call', 'application rejected'

answered Jun 12 '18 at 22:40

DYZ

55,249
10
64
93

This is what I understand so far - If I need to return more than one column, writing a function is not useful . I have used df.loc method before - but here I wanted to return both column at the same time instead of separately dealing with them using np.where or df.loc . But I guess that wasnt right approach. – singularity2047 Jun 12 '18 at 22:49
2

@singularity2047, Pandas is based on series arrays (columns). Updating each series individually in a vectorised fashion will usually be faster than updating them together via `pd.DataFrame.apply` (which is just a very inefficient loop). – jpp Jun 12 '18 at 22:57

score 2 · Answer 2 · answered Jun 12 '18 at 22:45

You should use numpy.where as per DyZ's solution. A principal benefit of Pandas is vectorised computations. However, below I'll show you how you would use pd.DataFrame.apply. Points to note:

Row data feeds your function one row at a time, not the entire dataframe in one go. Therefore, you should name arguments accordingly.
Two return statements in a function will not work. A function stops when it reaches return.
Instead, you need to return a list of results, then use pd.Series.values.tolist to unpack.

Here's a working example.

def get_comment_status(row):
    if row['address'] == 'NY':
        return ['call tomorrow', 'interview scheduled']
    else:
        return ['Dont call', 'application rejected']

df[['comment', 'selection_status']] = df.apply(get_comment_status, axis=1).values.tolist()

print(df)

  address  name1  name2        comment  score      selection_status
0      NY   john  mayer  call tomorrow     90   interview scheduled
1      CA  mayer  dylan      Dont call      8  application rejected
2      NJ  dylan  mayer      Dont call     88  application rejected
3      NY    bob    bob  call tomorrow     72   interview scheduled
4      WS   mary    bob      Dont call     34  application rejected
5      OR   jake    tim      Dont call     95  application rejected
6      OR    rob    ben      Dont call     50  application rejected

This is immensely helpful for me. Although I will lean towards np.where() from now on, I'd still like to learn different methods of doing same thing. — singularity2047, Jun 12 '18 at 23:02

Apply a function to modify multiple columns with if else logic

2 Answers2