float and integer comparison in a pandas dataframe

Question

I am aware of the technical limitations when comparing floats, but consider the following example:

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1': [1.12060000],
                   'col2': [1.12065000]})
df
Out[155]: 
        col1       col2
0 1.12060000 1.12065000

As you can see col2 and col1 are exactly 0.00005 apart. Now, I want to test that. I understand that this returns the wrong result because I am using decimals

(df.col2 - df.col1) < 0.00005
Out[156]: 
0    True
dtype: bool

However, more puzzling to me are the following results

(100000*df.col2 - 100000*df.col1) < 5
Out[157]: 
0    True
dtype: bool

while

(1000000*df.col2 - 1000000*df.col1) < 50
Out[158]: 
0    False
dtype: bool

Why does the comparison to 5 fails and only the last one works? I thought using integers would solve the issues when comparing floats?

Thanks!

My **guess** is this is the result of floating point precision. [Limiting floats to two decimal points](https://stackoverflow.com/q/455612/7758804) — Trenton McKinney, Nov 23 '20 at 21:42
yes, but the usual "cure" has always been to convert to integers (here by multiplying by 10000). There has to be a fool-proof way to proceed... I hope — ℕʘʘḆḽḘ, Nov 23 '20 at 21:43
Problem is, you lose association with the present of floating point precision. That is `(1000000*a - 1000000*b) != 1000000*(a-b)`. — Quang Hoang, Nov 23 '20 at 21:45
thanks @QuangHoang, could you please explain in more details the difference? — ℕʘʘḆḽḘ, Nov 23 '20 at 21:56

score 2 · Accepted Answer · answered Nov 23 '20 at 23:11

Floating point precision is the issue here. These numbers seem natural in base-10 but your computer stores them in base-2 which leads to weird things such as 0.1 + 0.2 = 0.30000000000000004. In your example:

>>> 1.12060000*100000, 1.12065000*100000
(112060.0, 112064.99999999999)
>>> 1.12060000*1000000, 1.12065000*1000000
(1120600.0, 1120650.0)

This is how the first difference is less than 5 (it's 4.99999999999)

yes, but the usual "cure" has always been to convert to integers (here by multiplying by 10000).

Ah! But you're not converting to integers! Just to larger floats! The "cure" here is to call round(float) or in the case of a DataFrame, df.round():

>>> round(1.12060000*100000), round(1.12065000*100000)
(112060, 112065)

really interesting. thanks! oddily enough, using `int()` did not work. perhaps `round` is the only way — ℕʘʘḆḽḘ, Nov 24 '20 at 01:52
`int()` always rounds down, so `112064.99999999999` becomes `112064`! — xjcl, Nov 24 '20 at 03:03

float and integer comparison in a pandas dataframe

1 Answers1