How can I run a numpy function percentile() on a masked array?

Question

I try to retrieve percentiles from an array with NoData values. In my case the Nodata values are represented by -3.40282347e+38. I thought a masked array would exclude this values from further calculations. I succesfully create the masked array but for the np.percentile() function the mask has no effect.

>>> DataArray = np.array(data)
>>> DataArray

([[ value, value...]], dtype=float32)

>>> masked_data = ma.masked_where(DataArray < 0, DataArray)
>>> p5 = np.percentile(masked_data, 5)
>>> print p5

 -3.40282347e+38

Best use masked methods or np.ma functions. Many np functions delegate to the methds but dont count on it — hpaulj, Jun 21 '16 at 05:11

score 13 · Answer 1 · answered Jan 16 '17 at 09:41

13

If you fill your masked values as np.nan, you could then use np.nanpercentile

import numpy as np
data = np.arange(-5.5,10.5) # Note that you need a non-integer array to store NaN
mdata = np.ma.masked_where(data < 0, data)
mdata = np.ma.filled(mdata, np.nan)
np.nanpercentile(mdata, 50) # 50th percentile

answered Jan 16 '17 at 09:41

alphabetasoup

537
7
15

1

This is certainly a convenient solution (e.g., it allows applying the percentiles over a particular `axis`, whereas simply calling `mdata.compressed()` does not), but I'm concerned that it's expensive. – Paul Price Jun 05 '20 at 19:50

score 10 · Accepted Answer · answered Jun 21 '16 at 05:54

Looking at the np.percentile code it is clear it does nothing special with masked arrays.

def percentile(a, q, axis=None, out=None,
               overwrite_input=False, interpolation='linear', keepdims=False):
    q = array(q, dtype=np.float64, copy=True)
    r, k = _ureduce(a, func=_percentile, q=q, axis=axis, out=out,
                    overwrite_input=overwrite_input,
                    interpolation=interpolation)
    if keepdims:
        if q.ndim == 0:
            return r.reshape(k)
        else:
            return r.reshape([len(q)] + k)
    else:
        return r

Where _ureduce and _percentile are internal functions defined in numpy/lib/function_base.py. So the real action is more complex.

Masked arrays have 2 strategies for using numpy functions. One is to fill - replace the masked values with innocuous ones, for example 0 when doing sum, 1 when doing a product. The other is to compress the data - that is, remove all masked values.

for example:

In [997]: data=np.arange(-5,10)
In [998]: mdata=np.ma.masked_where(data<0,data)

In [1001]: np.ma.filled(mdata,0)
Out[1001]: array([0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [1002]: np.ma.filled(mdata,1)
Out[1002]: array([1, 1, 1, 1, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [1008]: mdata.compressed()
Out[1008]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Which is going to give you the desired percentile? Filling or compressing? Or none. You need to understand the concept of percentile well enough to know how it should apply in the case of your masked values.

Compressed() did the trick for me. Since I needed to fully exclude the NoData values before percentile calculation. — EikeMike, Jun 21 '16 at 06:34

How can I run a numpy function percentile() on a masked array?

2 Answers2

Linked