numpy.sum transition to kahan but with masked arrays for increased precision

Question

I have a multi-array stack of data that is masked to exclude 'bad' or problematic values- this is in the 3rd dimension. Current code utilizes np.sum, but the level of precision (both large and small numbers) has negatively impacted results. I've attempted to implement the kahan_sum referenced here but forgotten about the masked arrays, and the results are not similar (due to masking). It is my hope that the added precision retention by utilizing a kahan summation and accumulator will permit downstream operations to maintain less error.

Source/research: https://github.com/numpy/numpy/issues/8786 Kahan summation Python floating point precision sum (I've jacked up the precision as far as possible but it doesn't help)


import numpy as np
import numpy.ma as ma


def kahan_sum(a, axis=None):
    s = numpy.zeros(a.shape[:axis] + a.shape[axis+1:])
    c = numpy.zeros(s.shape)
    for i in range(a.shape[axis]):
        # http://stackoverflow.com/a/42817610/353337
        y = a[(slice(None),) * axis + (i,)] - c
        t = s + y
        c = (t - s) - y
        s = t.copy()
    return s


data=np.random.rand(5,5,5)
dd=np.ma.masked_array(data=d, mask=np.random.rand(5,5,5)<0.2)

I want to sum along the 3rd (axis=2) as that's essentially my 'stack' of photos.

The masks are not coming out as I expected. It's possible I'm just overtired...

np.sum(dd, axis=2)
kahan_sum(dd, axis=2)

np.sum provides a fully populated array of data and excluded the 'masked' values. kahan_sum essentially or'd all of the masked values, and I've been unable to come up with a pattern for it.

Printing the mask is pretty evident that thats where the problem is; I'm just not figuring out how to fix it or why it's operating the way it is.

Thank you.

edit: Since I was trying to create a drop in substitute, the suggestion below hasn't worked. Currently I'm getting float divide by zero (somewhere) in my huge datasets, so I'm going to investigate further. I think I'm going to have to give up at getting better precision as Numppy just isn't giving me the flexibility to do so, and I'm not savvy enough to write in exhaustive functions to handle it. Any other ideas would be welcome and I'm grateful, Paul, for your suggestion- gave me a good start. — J.Hirsch, Jul 08 '19 at 12:22

score 1 · Accepted Answer · answered Jul 03 '19 at 20:43

1

If you really need more precision, consider using math.fsum which is accurate to fp resolution. If A is your 3D masked array, something like:

i,j,k = A.shape
np.frompyfunc(lambda i,j:math.fsum(A[i,j].compressed().tolist()),2,1)(*np.ogrid[:i,:j])

But before that I'd triple check thatnp.sum really isn't good enough. As far as I know it uses pairwise summation along contiguous axes which in practice tends to be pretty good.

answered Jul 03 '19 at 20:43

Paul Panzer

51,835
3
54
99

Unfortunately it's np.sum just hasn't worked. It has, I mean, for a few hundred million individual data sets but it still is off and I'm starting to bump up against edge cases. If that inline works I'll be ecstatic. I've not seen that type of response for any improved precision discussion anywhere. – J.Hirsch Jul 03 '19 at 21:16

numpy.sum transition to kahan but with masked arrays for increased precision

1 Answers1