0

Strange problem.

I have a dtype == object dataframe column with string values and NaNs. Looks like this:

df   
     Response    
0    Email
1    NaN
2    NaN
3    Call
4    Email
5    Email

I want to use fillna to fill the NaN values with the most frequently occurring value - which in this case is 'email'.

code looks like this:

import numpy as np
import pandas as pd

most_frequent_cat = str(df['Response']).mode())
df['Response_imputed'] = df['Response']
df['Response_imputed'].fillna(most_freq_cat, inplace = True)

The results look like this:

df   Response    

0    Email
1    0    Email\ndtype: object
2    0    Email\ndtype: object
3    Call
4    Email
5    Email

0 Email\ndtype: object is different than Email

If I remove the str there is no replacement of the original NaNs

What am I doing wrong?

cs95
  • 379,657
  • 97
  • 704
  • 746
Windstorm1981
  • 2,564
  • 7
  • 29
  • 57

2 Answers2

1

Don't use DataFrame.fillna with inplace=True. Actually I would recommend forgetting that argument exists entirely. Use Series.fillna instead since you only need this on one column and assign the result back.

Another thing to note is mode can return multiple modes if there is no single mode. In that case it should suffice to either select the first one, or one at random (an exercise for you).

Here's my recommended syntax:

# call fillna on the column and assign it back
df['Response'] = df['Response'].fillna(df['Response'].mode().iat[0])
df
 
  Response
0    Email
1    Email
2    Email
3     Call
4    Email
5    Email

You can also do a per column fill if you have multiple columns to fill NaNs for. Again the procedure is similar, call mode on your columns, then get the first mode for each column and use it as an argument to DataFeame.fillna this time:

df.fillna(df.mode().iloc[0])

  Response
0    Email
1    Email
2    Email
3     Call
4    Email
5    Email
cs95
  • 379,657
  • 97
  • 704
  • 746
  • That worked! I added the `.iat[0]`. Can you explain more thoroughly why that fixes the issue in your answer? I would never have figured that out and didn't see a solution anywhere I looked. Thanks. – Windstorm1981 Dec 16 '20 at 22:57
  • @Windstorm1981 I think in your case the issue was the `str(df['Response'])` which didn't make sense to me. Why are you converting the column to a string? – cs95 Dec 16 '20 at 23:00
  • Without conversion to a string `fillna()` didn't work at all - just left the NaNs there. I thought it might be a type problem. So does it seem like the problem is that I have more than one mode, e.g. maybe the same number of `Email` and `Call`? – Windstorm1981 Dec 16 '20 at 23:03
  • 1
    @Windstorm1981 I don't see that issue on your sample data but it's likely there in your actual data. – cs95 Dec 16 '20 at 23:03
  • What is `.iat[]`? Never saw that before. – Windstorm1981 Dec 16 '20 at 23:04
  • 1
    @Windstorm1981 it gets a scalar value from a cell. See [the docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html). There is also `.at`, the difference is that `iat` requires position, `.at` requires label. However if your DataFrame possesses an integer index they are interchangeable. – cs95 Dec 16 '20 at 23:13
  • Is `iat[]` new? I've seen `loc', `iloc`, 'ix`. never `iat[]` – Windstorm1981 Dec 16 '20 at 23:17
  • 1
    @Windstorm1981 It's been around since pre version 1 days (possibly way back from v0.16 even). `ix` is deprecated/removed now. – cs95 Dec 16 '20 at 23:18
1
import pandas as pd
d = {'Response': ['Email','NaN','NaN','Call','Email','Email']}
df = pd.DataFrame(data=d)

df['Response'].mode() 

output:

0    Email
dtype: object

Take the first line:

df['Response'].mode()[0] 

output:

'Email'
roadrunner66
  • 7,772
  • 4
  • 32
  • 38