1

Hope anyone can help explain the following behavior and how to fix it

I have a DataFrame like this:

l = [[1, 50, 3, 4.003], [1, 50, 4, 5.002], [2, 10, 3, 5.003], [2, 50, 2, 5.004]]
df = pd.DataFrame(l, columns=["a", "b", "c","d"])

df # Output:
    a   b   c   d
0   1   50  3   4.003
1   1   50  4   5.002
2   2   10  3   5.003
3   2   50  2   5.004

Now if I do a simple sum of column "d" row1 + row2 I should expect a result of 9.005, right? Well wrong, instead I get 9.004999999999999, this is what I do:

sum = df.loc[0]["d"] + df.loc[1]["d"]
sum # Output:
9.004999999999999

This affects the outcome of rounding to 2 decimal points after the sum.

I did some research on the internet and it might have to do with number of bytes used on the float64 data type but is there a way to overcome this issue?

By the way when I try the same with regular python variables it does the same thing:

d0 = 4.003
d1 = 5.002

d0 + d1 # Output:
9.004999999999999

The challenge is that if I need to round the results to two decimals I want to get 9.01 and not 9.0

round((d0 + d1), 2) # Output:
9.0  # It should have been 9.01

So when we have thousands or even millions of records one would have to identify which values are having the point-precision error before summing and rounding to avoid this issue?, it will be very impractical, any other suggestions?

Code Ninja 2C4U
  • 114
  • 1
  • 11
  • This is floating point precision error. You can add a small noise to get the desired output: `round( d0 + d1 + 1e-9, 2)`. – Quang Hoang Aug 31 '20 at 18:09
  • Also, might be relevant to [this question](https://stackoverflow.com/questions/10825926/python-3-x-rounding-behavior#:~:text=Python%203%27s%20way%20%28called%20%22%20round%20half%20to,large%20numbers%20of%20calculations%2C%20this%20can%20be%20significant.). – Quang Hoang Aug 31 '20 at 18:11
  • Thanks @QuangHoang, and how can I translate that into DataFrame sum? – Code Ninja 2C4U Aug 31 '20 at 18:11
  • `df['d'].rolling(2).sum()` or `df['d'] + df['d'].shift()`. – Quang Hoang Aug 31 '20 at 18:12
  • Appreciate your help but I am not sure that is the expected output when using `rolling` function, my real dataframe has thousands of records in column "d" that need to be summed (grouped by), I do notice from time to time that exact same input throws a .01 or even .02 after rounding and the reason is this floating point precision error, I guess I would have to identify which values have the error first and then sum. – Code Ninja 2C4U Aug 31 '20 at 18:28
  • The easiest way I think is to round to 3 decimals then use str format to shorten it to 2 decimals and covert that str back to float. So df["sum"]=df["d"].rolling(2).sum() df["sum"]=df["sum"].round(3).map("{:.2f}".format) – Gingerhaze Aug 31 '20 at 19:22

1 Answers1

0

Regarding your second example: you would need to use the Decimal package to avoid the round-off error and obtain your expected results.

from decimal import Decimal

d0 = Decimal('4.003')
d1 = Decimal('5.002')
print(d0 + d1)
print(round(d0 + d1, 2))

9.005
9.00

This paper might be helpful: https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf

The Kahan summation algorithm (to minimize round-off error) might also be relevant: https://en.wikipedia.org/wiki/Kahan_summation_algorithm

jsmart
  • 2,921
  • 1
  • 6
  • 13