2

I've been trying to replace missing values in a Pandas dataframe, but without success. I tried the .fillna method and also tried to loop through the entire data set, checking each cell and replacing NaNs with a chosen value. However, in both cases, Python executes the script without throwing up any errors, but the NaN values remain.

When I dug a bit deeper, I discovered behaviour that seems erratic to me, best demonstrated with an example:

In[ ] X['Smokinginpregnancy'].head() 

Out[ ] 

Index
E09000002          NaN
E09000003     5.216126
E09000004    10.287496
E09000005     3.090379
E09000006     6.080041
Name: Smokinginpregnancy, dtype: float64

I know for a fact that the first item in this column is missing and pandas recognises it as NaN. In fact, if I call this item on its own, python tells me it's NaN:

In [ ] X['Smokinginpregnancy'][0]
Out [ ]
nan

However, when I test whether it's NaN, python returns False.

In [ ] X['Smokinginpregnancy'][0] == np.nan
Out [ ] False

I suspect that when .fillna is being executed, python checks whether the item is NaN but gets back a False, so it continues, leaving the cell alone.

Does anyone know what's going on? Any solutions? (apart from opening the csv file in excel and then manually replacing the values.)

I'm using Anaconda's Python 3 distribution.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
user3353185
  • 127
  • 1
  • 12

2 Answers2

2

You are doing:

X['Smokinginpregnancy'][0] == np.nan

This is guaranteed to return False because all NaNs compare unequal to everything by IEEE754 standard:

>>> x = float('nan')
>>> x == x
False
>>> x == 1
False
>>> x == float('nan')
False

See also here. You have to use math.isnan to check for NaNs:

>>> math.isnan(x)
True

Or numpy.isnan

So use:

numpy.isnan(X['Smokinginpregnancy'][0])

Regarding pandas.fillna note that this function returns the filled array. Maybe you did something like:

X.fillna(...)

without reassigning X? Alternatively you must pass inplace=True to mutate the dataframe on which you are calling the method.

Community
  • 1
  • 1
Bakuriu
  • 98,325
  • 22
  • 197
  • 231
1

NaN in pandas can be check function pandas.isnull. I created boolean mask and return subset with NaN values.

Function filnna can be used for one column Smokinginpregnancy (more info in doc):

X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')

or

X['Smokinginpregnancy'].fillna('100', inplace=True)

Warning:
Sometimes inplace=True can be ignored, better is not use. - link, github, github 3 comments.

All together:

print X['Smokinginpregnancy'].head() 
#Index
#E09000002          NaN
#E09000003     5.216126
#E09000004    10.287496
#E09000005     3.090379
#E09000006     6.080041

#check NaN in column Smokinginpregnancy by boolean mask
mask = pd.isnull(X['Smokinginpregnancy'])
XNaN = X[mask]
print XNaN
#           Smokinginpregnancy
#Index
#E09000002                 NaN

#use function fillna for column Smokinginpregnancy
#X['Smokinginpregnancy'] = X['Smokinginpregnancy'].fillna('100')
X['Smokinginpregnancy'].fillna('100', inplace=True)
print X
#          Smokinginpregnancy
#Index
#E09000002                100
#E09000003           5.216126
#E09000004            10.2875
#E09000005           3.090379
#E09000006           6.080041

More information, why comparison doesn't work:

One has to be mindful that in python (and numpy), the nan's don’t compare equal, but None's do. Note that Pandas/numpy uses the fact that np.nan != np.nan, and treats None like np.nan. More info in Bakuriu's answer.

In [11]: None == None
Out[11]: True

In [12]: np.nan == np.nan
Out[12]: False
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • While this technically shows the OP what to do, it neither explains that the "erratic" behavior is in fact normal and expected (and why), nor why .fillna didn't work in the first place. – SethMMorton Nov 07 '15 at 15:42