6

So I was trying to replace np.nan values in my dataframe with None and noticed that in the process the datatype of the float columns in the dataframe changed to object even when they don't contain any missing data.

As an example:

import pandas as pd
import numpy as np
data = pd.DataFrame({'A':np.nan,'B':1.096, 'C':1}, index=[0])
data.replace(to_replace={np.nan:None}, inplace=True)

Call to data.dtypes before and after the call to replace shows that the datatype of column B changed from float to object whereas that of C stayed at int. If I remove column A from the original data that does not happen. I was wondering why that changes and how I can avoid this effect.

Chris
  • 445
  • 6
  • 11
  • @yatu: Why the instant close? The linked answer says nothing about why otherwise unrelated columns see change in their `dtype`; the behavior in OP does not appear if `A` is dropped prior to the replacement. – fuglede Dec 27 '19 at 12:30
  • 1
    Yes ur right, reopened @fuglede – yatu Dec 27 '19 at 12:30
  • Looks like a bug - could you report it here? https://github.com/pandas-dev/pandas/issues – ignoring_gravity Dec 27 '19 at 12:32
  • Looks buggy to me. Can't see why replacing `NaNs` should also affect float columns with no missing values. I'd suggest reporting it as @ignoring_gravity suggests if you cannot find related issues – yatu Dec 27 '19 at 12:35
  • pure speculation, but I assume that None is treated as a string purely _because_ the `np.nan` value exists, as in there is no clear definition of `None` in a string column or a numeric column, thus its treated as an object by default. – Umar.H Dec 27 '19 at 12:37
  • This appears to already be a known issue https://github.com/pandas-dev/pandas/issues/23305 – CumminUp07 Dec 27 '19 at 13:22

2 Answers2

1

I've come across this many times, and there is a fix. precede your usage of your replace with astype(object) and it will preserve the dtypes. I've had to use this for merge issues, combine issues, etc. I'm not sure why it preserves the types when used this way, but it does and it's useful once you find out about it.

data.info()    

#<class 'pandas.core.frame.DataFrame'>
#Int64Index: 1 entries, 0 to 0
#Data columns (total 3 columns):
#A    0 non-null float64
#B    1 non-null float64
#C    1 non-null int64
#dtypes: float64(2), int64(1)
#memory usage: 32.0 bytes

import pandas as pd 
import numpy as np 
data = pd.DataFrame({'A':np.nan,'B':1.096, 'C':1}, index=[0]) 
data.replace(to_replace={np.nan:None}, inplace=True)                                                                                                                                 

data.info()   

#<class 'pandas.core.frame.DataFrame'>
#Int64Index: 1 entries, 0 to 0
#Data columns (total 3 columns):
#A    0 non-null object
#B    1 non-null object
#C    1 non-null int64
#dtypes: int64(1), object(2)
#memory usage: 32.0+ bytes

import pandas as pd 
import numpy as np 
data = pd.DataFrame({'A':np.nan,'B':1.096, 'C':1}, index=[0]) 
data.astype(object).replace(to_replace={np.nan:None}, inplace=True)                                                                                                                  

data.info()                                                                                                                                                                          

#<class 'pandas.core.frame.DataFrame'>
#Int64Index: 1 entries, 0 to 0
#Data columns (total 3 columns):
#A    0 non-null float64
#B    1 non-null float64
#C    1 non-null int64
#dtypes: float64(2), int64(1)
#memory usage: 32.0 bytes
oppressionslayer
  • 6,942
  • 2
  • 7
  • 24
  • you're not actually setting data in the second example. You simply called .info on the original df – Stevie May 10 '23 at 23:30
0

It works fine, when you replace per column, and call replace from pd.Series(...) rather than from the pd.DataFrame(...).

Except, as mentioned in the comment NoneType() cannot be casted to float (or int, or any numeric - you would rather use NaN instead), hence it will be automatically casted to object.

import pandas as pd
import numpy as np
data = pd.DataFrame({'A':np.nan,'B':1.096, 'C':1}, index=[0])
print(data)
print(data.dtypes)
for col in data.columns:
    data[col].replace(to_replace={np.nan: None}, inplace=True)
print(data)
print(data.dtypes)

Output:

      A      B  C
0 NaN  1.096  1

A    float64
B    float64
C      int64
dtype: object
      A      B  C
0  None  1.096  1

A     object
B    float64
C      int64
dtype: object
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34