2

Pandas seems to be promoting an int to a float when filtering. I've provided a simple snippet below but I've got a much more complex example which I believe this promotion leads to incorrect filtering because it compares floats. Is there a way around this? I read that this is a change of behaviour between different versions of pandas - it certainly didn't use to be the case.

Below you can see, it changes [4 13] and [5 14] to [4.0 13.0] and [5.0 14.0].

In [53]: df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 'col2' : [10, 11, 12, 13, 14]})  
    ...: df2 = pd.DataFrame(data = {'col1' : [1, 2, 3], 'col2' : [10, 11, 12]})                                                                                             

In [54]: df1                                                                                                                                                                
Out[54]: 
   col1  col2
0     1    10
1     2    11
2     3    12
3     4    13
4     5    14

In [55]: df2                                                                                                                                                                
Out[55]: 
   col1  col2
0     1    10
1     2    11
2     3    12

In [56]: df1[~df1.isin(df2)]                                                                                                                                                
Out[56]: 
   col1  col2
0   NaN   NaN
1   NaN   NaN
2   NaN   NaN
3   4.0  13.0
4   5.0  14.0

In [57]: df1[~df1.isin(df2)].dropna()                                                                                                                                       
Out[57]: 
   col1  col2
3   4.0  13.0
4   5.0  14.0

In [58]: df1[~df1.isin(df2)].dtypes                                                                                                                                         
Out[58]: 
col1    float64
col2    float64
dtype: object

In [59]: df1.dtypes                                                                                                                                                         
Out[59]: 
col1    int64
col2    int64
dtype: object

In [60]: df2.dtypes                                                                                                                                                         
Out[60]: 
col1    int64
col2    int64
dtype: object
s5s
  • 11,159
  • 21
  • 74
  • 121
  • 3
    It's not because of float comparison, it's because of the `NaN`'s. You could use the `Int64` dtype which has integer `NaN`'s if you wish. – user3483203 Nov 01 '19 at 15:55

1 Answers1

1

There is no float comparison happening here. isin is returning NaN's for missing data, and since you are using numpy's int64, the result is getting cast to float64.

In 0.24, pandas added a nullable integer dtype, which you can use here.


df1 = df1.astype('Int64')
df2 = df2.astype('Int64')

df1[~df1.isin(df2)]

   col1  col2
0   NaN   NaN
1   NaN   NaN
2   NaN   NaN
3     4    13
4     5    14

Just be aware that if you wanted to use numpy operations on the result, numpy would treat the above as an array with dtype object.

user3483203
  • 50,081
  • 9
  • 65
  • 94