1

I am aware of the technical limitations when comparing floats, but consider the following example:

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1': [1.12060000],
                   'col2': [1.12065000]})
df
Out[155]: 
        col1       col2
0 1.12060000 1.12065000

As you can see col2 and col1 are exactly 0.00005 apart. Now, I want to test that. I understand that this returns the wrong result because I am using decimals

(df.col2 - df.col1) < 0.00005
Out[156]: 
0    True
dtype: bool

However, more puzzling to me are the following results

(100000*df.col2 - 100000*df.col1) < 5
Out[157]: 
0    True
dtype: bool

while

(1000000*df.col2 - 1000000*df.col1) < 50
Out[158]: 
0    False
dtype: bool

Why does the comparison to 5 fails and only the last one works? I thought using integers would solve the issues when comparing floats?

Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

1 Answers1

2

Floating point precision is the issue here. These numbers seem natural in base-10 but your computer stores them in base-2 which leads to weird things such as 0.1 + 0.2 = 0.30000000000000004. In your example:

>>> 1.12060000*100000, 1.12065000*100000
(112060.0, 112064.99999999999)
>>> 1.12060000*1000000, 1.12065000*1000000
(1120600.0, 1120650.0)

This is how the first difference is less than 5 (it's 4.99999999999)

yes, but the usual "cure" has always been to convert to integers (here by multiplying by 10000).

Ah! But you're not converting to integers! Just to larger floats! The "cure" here is to call round(float) or in the case of a DataFrame, df.round():

>>> round(1.12060000*100000), round(1.12065000*100000)
(112060, 112065)
xjcl
  • 12,848
  • 6
  • 67
  • 89