2

I have a dataframe and a function that I want to apply on multiple columns via pandas.apply. Currently I am doing so using a for loop, which I would like to replace with a single line of code.

This is my dataframe:

d1 = {'id': [1, 1, 2], 'event': ['e', 'c', 'e'], 'var1': [1, 2, 2], 'time_difference': [0, 5, 2]}
df1 = pd.DataFrame(data=d1)
...
>> df1 
   id event  var1  time_difference
0   1     e     1                0
1   1     c     2                5
2   2     e     2                2

This is the function that I want to apply:

def merge_based_on_timelimit(row):
    return row[column_of_interest] if row['time_difference'] <= 1\
        else pd.NA

These are my columns of interest (on which I want to apply the function on):

columns_of_interest = ['event', 'var1']

Currently I am applying my function on all columns of interest via a for loop:

for column_of_interest in columns_of_interest:
    df1[column_of_interest] = df1.apply(merge_based_on_timelimit, axis=1)

However, I am looking for a way to skip the loop and to instead apply my function directly onto all columns of interest. How can I do this? So far I have tried the following:

df1[columns_of_interest] = df1[columns_of_interest].apply(merge_based_on_timelimit, axis=1)

which returns me the following error:

...
redcap[columns_of_interest] = redcap[columns_of_interest].apply(merge_based_on_timelimit, axis=1)
...
KeyError: 'time_difference'
wibeasley
  • 5,000
  • 3
  • 34
  • 62

1 Answers1

1

In my opinion here apply is not necessary, use DataFrame.loc for set values by inverted mask by > 1:

df1.loc[df1['time_difference'] > 1, columns_of_interest] = pd.NA

print (df1)
   id event  var1  time_difference
0   1     e     1                0
1   1  <NA>  <NA>                5
2   2  <NA>  <NA>                2

Your solution is possible chnage by:

def merge_based_on_timelimit(row):
    #added s to column_of_interest
    return row[columns_of_interest] if row['time_difference'] <= 1\
        else pd.NA
columns_of_interest = ['event', 'var1']

#added column time_difference to list
df1[columns_of_interest] = df1[columns_of_interest + ['time_difference']].apply(merge_based_on_timelimit, axis=1)

print (df1)
   id event  var1  time_difference
0   1     e     1                0
1   1  <NA>  <NA>                5
2   2  <NA>  <NA>                2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252