Floating point numbers are only accurate to a certain number of significant figures. Imagine if all of your numbers - including intermediate results - are only accurate to two significant figures, and you want the sum of the list [100, 1, 1, 1, 1, 1, 1]
.
- The "true" sum is 106, but this cannot be represented since we're only allowed two significant figures;
- The "correct" answer is 110, since that's the "true" sum rounded to 2 s.f.;
- But if we naively add the numbers in sequence, we'll first do 100 + 1 = 100 (to 2 s.f.), then 100 + 1 = 100 (to 2 s.f.), and so on until the final result is 100.
The "correct" answer can be achieved by adding the numbers up from smallest to largest; 1 + 1 = 2, then 2 + 1 = 3, then 3 + 1 = 4, then 4 + 1 = 5, then 5 + 1 = 6, then 6 + 100 = 110 (to 2 s.f.). However, even this doesn't work in the general case; if there were over a hundred 1s then the intermediate sums would start being inaccurate. You can do even better by always adding the smallest two remaining numbers.
Python's built-in sum
function uses the naive algorithm, while df['series'].sum()
method uses a more accurate algorithm with a lower accumulated rounding error. From the numpy source code, which pandas uses:
For floating point numbers the numerical precision of sum (and
np.add.reduce
) is in general limited by directly adding each number
individually to the result causing rounding errors in every step.
However, often numpy will use a numerically better approach (partial
pairwise summation) leading to improved precision in many use-cases.
This improved precision is always provided when no axis
is given.
The math.fsum function uses an algorithm which is more accurate still:
In contrast to NumPy, Python's math.fsum
function uses a slower but
more precise approach to summation.
For your list, the result of math.fsum
is -1.484363
, which is the correctly-rounded answer.