0

I am doing some operations on a pandas dataframe, specifically:

  • Dropping a column
  • Using the dataframe.apply() function to add a column based on an existing one

Here's the simplest test-case I've been able to create:

import pandas as pd

df = pd.DataFrame(
    [["Fred", 1, 44],
     ["Wilma", 0, 39],
     ["Barney", 1, None]],
    columns=["Name", "IntegerColumn", "Age" ])

def translate_age(series):
    if not np.isnan(series['Age']):    
        series["AgeText"] = "Over 40" if series["Age"] > 40 else "Under 40"
    else:
        series["AgeText"]  = "Unknown"
    return series
    
df = df.drop('Name', axis=1)
print('@ before', df['IntegerColumn'].dtypes)
df = df.apply(func=translate_age, axis=1)
print('@ after', df['IntegerColumn'].dtypes)

The print() output shows the change in the IntegerColumn's type. It started as an integer:

@ before int64

... and then after the apply() call, it changes to a float:

@ after float64

Initially, the dataframe looks like this:

     Name  IntegerColumn   Age
0    Fred              1  44.0
1   Wilma              0  39.0
2  Barney              1   NaN

... after the apply() call, it looks like this:

   IntegerColumn   Age   AgeText
0            1.0  44.0   Over 40
1            0.0  39.0  Under 40
2            1.0   NaN   Unknown

Why is the IntegerColumn changing from an integer to a float in this case? And how can I stop it from doing so?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
antun
  • 2,038
  • 2
  • 22
  • 34
  • 1
    I just edited to put the actual text instead of pictures. See [Why should I not upload images of code/data/errors?](https://meta.stackoverflow.com/q/285551/4518341) – wjandrea Mar 05 '23 at 03:40
  • I thought it was overkill to put the pictures in the post. Can I ask what you used to get the simple text printout of the dataframe? I searched and found some solutions for prettyprinting, but they rely on extra libraries: https://stackoverflow.com/questions/18528533/pretty-printing-a-pandas-dataframe – antun Mar 05 '23 at 17:03
  • 1
    You can just use `print(df)`, that's all. Or if you want to customize the output, use `df.to_string()`. Although, I'm using VSCode Jupyter which has a "Change Presentation" option that lets you toggle plaintext and HTML when you just write `df`. – wjandrea Mar 05 '23 at 18:34
  • Ah, that makes sense. I was so used to the default behavior of notebooks that print out whatever the last thing you return, so I never even tried printing a dataframe. – antun Mar 05 '23 at 22:45

1 Answers1

1

When you do the apply, the rows get converted to a common dtype, i.e. float. If you didn't drop the string column, that wouldn't be possible, so the conversion wouldn't happen.

What you're doing is recommended against in the docs for DataFrame.apply():

Notes

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See Mutating with User Defined Function (UDF) methods for more details.

Instead, assign the whole column at once, for example like this:

def translate_age(age):
    if np.isnan(age):
        return "Unknown"
    return "Over 40" if age > 40 else "Under 40"

df['AgeText'] = df['Age'].apply(translate_age)
wjandrea
  • 28,235
  • 9
  • 60
  • 81