I am doing some operations on a pandas dataframe, specifically:
- Dropping a column
- Using the
dataframe.apply()
function to add a column based on an existing one
Here's the simplest test-case I've been able to create:
import pandas as pd
df = pd.DataFrame(
[["Fred", 1, 44],
["Wilma", 0, 39],
["Barney", 1, None]],
columns=["Name", "IntegerColumn", "Age" ])
def translate_age(series):
if not np.isnan(series['Age']):
series["AgeText"] = "Over 40" if series["Age"] > 40 else "Under 40"
else:
series["AgeText"] = "Unknown"
return series
df = df.drop('Name', axis=1)
print('@ before', df['IntegerColumn'].dtypes)
df = df.apply(func=translate_age, axis=1)
print('@ after', df['IntegerColumn'].dtypes)
The print()
output shows the change in the IntegerColumn's type. It started as an integer:
@ before int64
... and then after the apply()
call, it changes to a float:
@ after float64
Initially, the dataframe looks like this:
Name IntegerColumn Age
0 Fred 1 44.0
1 Wilma 0 39.0
2 Barney 1 NaN
... after the apply()
call, it looks like this:
IntegerColumn Age AgeText
0 1.0 44.0 Over 40
1 0.0 39.0 Under 40
2 1.0 NaN Unknown
Why is the IntegerColumn changing from an integer to a float in this case? And how can I stop it from doing so?