-1

I have an int dataframe:

   0   1   2
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

But if I set a value to NaN, the whole column is cast to floats! Apparently int columns can't have NaN values. But why is that?

>>> df.iloc[2,1] = np.nan
>>> df
   0     1   2
0  0   1.0   2
1  3   4.0   5
2  6   NaN   8
3  9  10.0  11
Jeffery
  • 629
  • 1
  • 7
  • 17
  • check the output of `type(df.iloc[2,1])` and have a look on this [why-is-nan-considered-as-a-float](https://stackoverflow.com/questions/48558973/why-is-nan-considered-as-a-float) – Anurag Dabas Jun 24 '21 at 09:21
  • 1
    @AnuragDabas is right, `NaN` is a float. You can try `pd.NA`to replace nan values. – Corralien Jun 24 '21 at 09:35

2 Answers2

2

For performance reasons (which make a big impact in this case), Pandas wants your columns to be from the same type, and thus will do its best to keep it that way. NaN is a float value, and all your integers can be harmlessly converted to floats, so that's what happens.

If it can't, you get what needs to happen to make this work:

>>> x = pd.DataFrame(np.arange(4).reshape(2,2))
>>> x
   0  1
0  0  1
1  2  3
>>> x[1].dtype
dtype('int64')
>>> x.iloc[1, 1] = 'string'
>>> x
   0       1
0  0       1
1  2  string
>>> x[1].dtype
dtype('O')

since 1 can't be converted to a string in a reasonable manner (without guessing what the user wants), the type is converted to object which is general and doesn't allow for any optimizations. This gives you what is needed to make what you want work though (a multi-type column):

>>> x[1] = x[1].astype('O') # Alternatively use a non-float NaN object
>>> x.iloc[1, 1] = np.nan  # or float('nan')
>>> x
   0    1
0  0    1
1  2  NaN

This is usually not recommended at all though if you don't have to.

kabanus
  • 24,623
  • 6
  • 41
  • 74
1

Not best but visually better is to use pd.NA rather than np.NaN:

>>> df.iloc[2,1] = pd.NA
>>> df
   0     1   2
0  0     1   2
1  3     4   5
2  6  <NA>   8
3  9    10  11

Seems to be good but:

>>> df.dtypes
0     int64
1    object  # <- not float, but object
2     int64
dtype: object

You can read this page from the documentation.

Corralien
  • 109,409
  • 8
  • 28
  • 52
  • it would be more performant to use floats as opposed to pd.NaN which stores the type as strings, fewer bytes – Umar.H Jun 24 '21 at 09:55