3

isin() is giving me weird results. I create the following DataFrame:

import pandas as pd
import numpy as np

test=pd.DataFrame({'1': np.linspace(0.0, 1.0, 11)})

>>> test['1']
0     0.0
1     0.1
2     0.2
3     0.3
4     0.4
5     0.5
6     0.6
7     0.7
8     0.8
9     0.9
10    1.0
Name: 1, dtype: float64

Using (apparently) the same array isin() gives me now something weird.

>>> test['1'].isin([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
0      True
1      True
2      True
3     False
4      True
5      True
6     False
7     False
8      True
9      True
10     True
Name: 1, dtype: bool

I suspect some numerical problems or something that has something to do with the data type. Can somebody explain this and tell me how to prevent it?

ALollz
  • 57,915
  • 7
  • 66
  • 89
Gflaesch
  • 41
  • 4

4 Answers4

2

No, it is in fact identifying them correctly. This has more to do with physics on a lower level inside the CPU ( see here ) so you need to be careful with those things:

print(test["1"].array)
<PandasArray>
[                0.0,                 0.1,                 0.2,
 0.30000000000000004,                 0.4,                 0.5,
  0.6000000000000001,  0.7000000000000001,                 0.8,
                 0.9,                 1.0]
Length: 11, dtype: float64

However.

print(test['1'].isin(np.linspace(0.0,1.0,11)))
0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
Name: 1, dtype: bool
Alexander Ejbekov
  • 5,594
  • 1
  • 26
  • 26
1

isin compares the exact values, so using it on float values is almost never a good idea. There might be floating point error that is not visible. For example,

for x in np.linspace(0.0,1.0,11): print(x)

gives you:

0.0
0.1
0.2
0.30000000000000004
0.4
0.5
0.6000000000000001
0.7000000000000001
0.8
0.9
1.0

That says 0.3 you see in test is not really 0.3.

Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • I see, thank you! - Is there a better way of comparing the entries then? In my real case I have measurements in a dataframe and want to look for the indices of the matches. – Gflaesch Apr 24 '20 at 15:35
  • often people would add a small tolerance when comparing floats. That is if `abs(a-b) < tolerance` then they are considered equal. – Quang Hoang Apr 24 '20 at 15:42
1

It will work only if you do this:

test['1'] = test['1'].map(lambda x: '%.1f' % x)
print(test['1'].astype(np.float).isin([0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0 ]))

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
NYC Coder
  • 7,424
  • 2
  • 11
  • 24
1

Use np.isclose when you want to do "equality" checks on floats. Use broadcasting to do all of the comparisons and np.logical_or.reduce to combine the results into a single mask indicating it "equals" any element.

import numpy as np
import pandas as pd

test = pd.DataFrame({'1': np.linspace(0.0, 1.1, 12)})
l = [0., 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.]
arr = np.array(l)  # So we can broadcast

test['in_l_close'] = np.logical_or.reduce(np.isclose(test['1'].to_numpy()[None, :], arr[:, None]))
test['in_l_naive'] = test['1'].isin(l)  #For comparision to show flaws.

print(test)

      1  in_l_close  in_l_naive
0   0.0        True        True
1   0.1        True        True
2   0.2        True        True
3   0.3        True       False
4   0.4        True        True
5   0.5        True        True
6   0.6        True       False
7   0.7        True       False
8   0.8        True        True
9   0.9        True        True
10  1.0        True        True
11  1.1       False       False
ALollz
  • 57,915
  • 7
  • 66
  • 89