Intro and reproducible code snippet
I'm having a hard time performing an operation on a few columns that requires the checking of a condition using an if/else
statement.
More specifically, I'm trying to perform this check within the confines of the assign
method of a Pandas Dataframe. Here is an example of what I'm trying to do
# Importing Pandas
import pandas as pd
# Creating synthetic data
my_df = pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],
'col2':[11,22,33,44,55,66,77,88,99,1010]})
# Creating a separate output DataFrame that doesn't overwrite
# the original input DataFrame
out_df = my_df.assign(
# Successfully creating a new column called `col3` using a lambda function
col3=lambda row: row['col1'] + row['col2'],
# Using a new lambda function to perform an operation on the newly
# generated column.
bleep_bloop=lambda row: 'bleep' if (row['col3']%8 == 0) else 'bloop')
The code above yeilds a ValueError
:
ValueError: The truth value of a Series is ambiguous
When trying to investigate the error, I found this SO thread. It seems that lambda
functions don't always work very nicely with conditional logic in a DataFrame, mostly due to the DataFrame's attempt to deal with things as Series.
A few dirty workarounds
Use apply
A dirty workaround would be to make col3
using the assign
method as indicated above, but then create the bleep_bloop
column using an apply
method instead:
out_sr = (my_df.assign(
col3=lambda row: row['col1'] + row['col2'])
.apply(lambda row: 'bleep' if (row['col3']%8 == 0)
else 'bloop', axis=1))
The problem here is that the code above returns only a Series with the results of the bleep_bloop
column instead of a new DataFrame with both col3
and bleep_bloop
.
On the fly vs. multiple commands
Yet another approach would be to break one command into two:
out_df_2 = (my_df.assign(col3=lambda row: row['col1'] + row['col2']))
out_df_2['bleep_bloop'] = out_df_2.apply(lambda row: 'bleep' if (row['col3']%8 == 0)
else 'bloop', axis=1)
This also works, but I'd really like to stick to the on-the-fly approach where I do everything in one chained command, if possible.
Back to the main question
Given that the workarounds I showed above are messy and don't really get the job done like I need, is there any other way I can create a new column that's based on using a conditional if/else
statement?
The example I gave here is pretty simple, but consider that the real world application would likely involve applying custom-made functions (e.g.: out_df=my_df.assign(new_col=lambda row: my_func(row))
, where my_func
is some complex function that uses several other columns from the same row as inputs).