pandas apply custom functions to rows based on condition

Question

I have a pandas df and a bunch of custom functions written to do data checks on survey data. We have a number of exceptions where certain data checks should or should not be done - these are based off a categorical variable or a date variable. When doing something like this:

def data_check(df):
    if df[string_col]== 'some string':
        df = package.f1(df, other_col1)
    df = package.f2(df, other_col1, other_col2)
    if df[date_col]> some_datetime_obj:
        df = package.f3(df, other_col3)
    return(df)

clean_df = data_check(dirty_df)

I get this error: Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

Thanks!!

You are comparing a series `df[string_col]` to a string 'some string'. Therefore, the output will be a series. As the error message says, you need to be specific about what you are testing, e.g. are all or any of the Boolean series output meant to be True? — jpp, Jan 26 '18 at 21:16

score 0 · Answer 1 · answered Jan 26 '18 at 21:16

df[string_col] is a column, but 'some string' is a single string. When you do df[string_col]== 'some string', the comparison =='some string' is broadcast over the column. You get a separate boolean for each value in the column, but the if is expecting a single boolean.

Also, when you indent, what you write is automatically interpreted as code. You don't need to both indent and enclose in grave marks. (Grave marks are these things: `)

pandas apply custom functions to rows based on condition

1 Answers1