0

I want to create a new column in a dataframe based on if/then logic. The rules for the actual problem are the output of a CART tree so fairly complex. The problem that I have is that when I try to apply the function to my dataframe, I get the error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I am pretty sure that this is because the 'if' logic is trying to evaluate the input as a series as opposed to on a row by row basis. I just can't figure out the solution.

To replicate:

import pandas as pd
import numpy as np
np.random.seed(1)

#create sample dataframe
df_test = pd.DataFrame({"llflag": np.random.normal(0,1,100)})

#sample if/else logic
def tree1(df):
  if df['llflag'] <= 0.5:
      return 4
  else:  
      return 3
  return 

#attempt to apply function to df
df_test['testRR'] = df_test.apply(tree1(df_test ), axis = 1)

I got the same results with.

df_test['testRR'] = df_test.apply(lambda  x: tree1( df_test), axis = 1)'''

what am I missing? Thanks in advance.

29Clyde
  • 33
  • 3

2 Answers2

3

You want to apply the function for each row, not apply the function evaluated on df_test (which fails), so remove the parentheses:

df_test['testRR'] = df_test.apply(tree1, axis = 1)

Also trying to discourage using apply, so here's a different faster version:

df_test['testRR'] = np.where(df_test['llflag'] <= 0.5, 4, 3)

Or a list comp version (also faster):

def tree2(row):
    return 4 if row <=0.5 else 3

df_test['testRR'] = [tree2(row) for row in df_test["llflag"]]
Tom
  • 8,310
  • 2
  • 16
  • 36
  • Thanks for the suggestion using 'where'. While that is what I typically use, in this case, I am trying to apply the output of classification tree model that has 9 feature inputs - the combined 'if' logic has more than 600 comparisons so I don't think feasible to implement with 'where'. I suspect that there is a better way to apply a CART decision tree to a dataframe but will need to research that more when I have time. Thanks again. – 29Clyde Jun 25 '20 at 15:15
1

Remove the (df_test)

df_test['testRR'] = df_test.apply(tree1(df_test ), axis = 1)

This will apply the function for each row