6

I'm summing up the values in a series, but depending on how I do it, I get different results. The two ways I've tried are:

sum(df['series'])

df['series'].sum()

Why would they return different values?

Sample Code.

s = pd.Series([
0.428229
 , -0.948957
 , -0.110125
 ,  0.791305
 ,  0.113980
 ,-0.479462
 ,-0.623440
 ,-0.610920
 ,-0.135165
 , 0.090192])

 print(s.sum())
 print(sum(s))

 -1.4843630000000003
 -1.4843629999999999

The difference is quite small here, but in a dataset with a few thousand values, it becomes quite large.

martineau
  • 119,623
  • 25
  • 170
  • 301
wilson_smyth
  • 1,202
  • 1
  • 14
  • 39
  • 1
    Please provide a working example that shows that this is behaving differently. – deets Dec 01 '19 at 17:51
  • 2
    Floats representation in binary are tricky. I would go with ```sum(s*10**10)/10**10```. – accdias Dec 01 '19 at 18:06
  • BTW, ```(s*10**10).sum()/10**10 == sum(s*10**10)/10**10``` is ```True```. – accdias Dec 01 '19 at 18:13
  • 1
    More precision is from [math.fsum](https://www.quora.com/What-is-the-difference-between-sum-and-fsum-in-Python) which tracks intermediate results during summation to minimize precision loss. math.fsum(s) = -1.484363 – DarrylG Dec 01 '19 at 18:20
  • Floating point arithmetic on a computer is inherently imprecise. Suggest you check to see if the two sums are approximately equal rather than exactly so. See question [Is floating point math broken?](https://stackoverflow.com/questions/588004/is-floating-point-math-broken) and this [appendix](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) to an Oracle Numerical Computation Guide titled _What Every Computer Scientist Should Know About Floating-Point Arithmetic_. – martineau Dec 01 '19 at 18:32
  • For doing arbitrary-precision floating point arithmetic, I suggest using [mpmath](http://mpmath.org/), which you can (also) get from [pypi](https://pypi.org/project/mpmath/). – martineau Dec 01 '19 at 18:40
  • btw Pandas is built on Numpy. – Geeocode Dec 01 '19 at 18:42

1 Answers1

6

Floating point numbers are only accurate to a certain number of significant figures. Imagine if all of your numbers - including intermediate results - are only accurate to two significant figures, and you want the sum of the list [100, 1, 1, 1, 1, 1, 1].

  • The "true" sum is 106, but this cannot be represented since we're only allowed two significant figures;
  • The "correct" answer is 110, since that's the "true" sum rounded to 2 s.f.;
  • But if we naively add the numbers in sequence, we'll first do 100 + 1 = 100 (to 2 s.f.), then 100 + 1 = 100 (to 2 s.f.), and so on until the final result is 100.

The "correct" answer can be achieved by adding the numbers up from smallest to largest; 1 + 1 = 2, then 2 + 1 = 3, then 3 + 1 = 4, then 4 + 1 = 5, then 5 + 1 = 6, then 6 + 100 = 110 (to 2 s.f.). However, even this doesn't work in the general case; if there were over a hundred 1s then the intermediate sums would start being inaccurate. You can do even better by always adding the smallest two remaining numbers.

Python's built-in sum function uses the naive algorithm, while df['series'].sum() method uses a more accurate algorithm with a lower accumulated rounding error. From the numpy source code, which pandas uses:

For floating point numbers the numerical precision of sum (and np.add.reduce) is in general limited by directly adding each number individually to the result causing rounding errors in every step. However, often numpy will use a numerically better approach (partial pairwise summation) leading to improved precision in many use-cases. This improved precision is always provided when no axis is given.

The math.fsum function uses an algorithm which is more accurate still:

In contrast to NumPy, Python's math.fsum function uses a slower but more precise approach to summation.

For your list, the result of math.fsum is -1.484363, which is the correctly-rounded answer.

kaya3
  • 47,440
  • 4
  • 68
  • 97
  • 1
    Great post. In addition [test of the accuracy of various methods here](https://pypi.org/project/accupy/) including Python, Numpy, [Kahan summation](https://en.wikipedia.org/wiki/Kahan_summation_algorithm), math. fsum, etc. The conclusion is math.fsum is the most accurate but slowest method. – DarrylG Dec 01 '19 at 19:30