0

Maybe this was answered before, but I'm trying to understand what is the best way to work with Pandas subtraction.

import pandas as pd
import random
import numpy as np

random.seed(42)
data = {'r': list([float(random.random()) for i in range(5)])}
for i in range(5):
    data['r'].append(float(0.7))
df = pd.DataFrame(data)

If I run the following, I get the expected results:

print(np.sum(df['r'] >= 0.7))
6

However, if I modify slightly the condition, I don't get the expected results:

print(np.sum(df['r']-0.5 >= 0.2))
1

The same happens if I try to fix it by casting into float or np.float64 (and combinations of this), like the following:

print(np.sum(df['r'].astype(np.float64)-np.float64(0.5) >= np.float64(0.2)))
1

For sure I'm not doing the casting properly, but any help on this would be more than welcome!

glhuilli
  • 46
  • 7

2 Answers2

1

You're not doing anything improperly. This is a totally straightforward floating point error. It will always happen.

>>> 0.7 >= 0.7
True
>>> (0.7 - 0.5) >= 0.2
False

You have to remember that floating point numbers are represented in binary, so they can only represent sums of powers of 2 with perfect precision. Anything that can't be represented finitely as a sum of powers of two will be subject to error like this.

You can see why by forcing Python to display the full-precision value associated with the literal 0.7:

format(0.7, '.60g') 
'0.6999999999999999555910790149937383830547332763671875'
senderle
  • 145,869
  • 36
  • 209
  • 233
0

To add to @senderle answer, since this is a floating point issue you can solve it by:

((df['r'] - 0.5) >= 0.19).sum()

Oh a slightly different note, I'm not sure why you use np.sum when you could just use pandas .sum, seems like an unnecessary import

Kenan
  • 13,156
  • 8
  • 43
  • 50