13

I try to retrieve percentiles from an array with NoData values. In my case the Nodata values are represented by -3.40282347e+38. I thought a masked array would exclude this values from further calculations. I succesfully create the masked array but for the np.percentile() function the mask has no effect.

>>> DataArray = np.array(data)
>>> DataArray

([[ value, value...]], dtype=float32)

>>> masked_data = ma.masked_where(DataArray < 0, DataArray)
>>> p5 = np.percentile(masked_data, 5)
>>> print p5

 -3.40282347e+38
EikeMike
  • 280
  • 3
  • 12
  • 1
    Best use masked methods or np.ma functions. Many np functions delegate to the methds but dont count on it – hpaulj Jun 21 '16 at 05:11

2 Answers2

13

If you fill your masked values as np.nan, you could then use np.nanpercentile

import numpy as np
data = np.arange(-5.5,10.5) # Note that you need a non-integer array to store NaN
mdata = np.ma.masked_where(data < 0, data)
mdata = np.ma.filled(mdata, np.nan)
np.nanpercentile(mdata, 50) # 50th percentile
alphabetasoup
  • 537
  • 7
  • 15
  • 1
    This is certainly a convenient solution (e.g., it allows applying the percentiles over a particular `axis`, whereas simply calling `mdata.compressed()` does not), but I'm concerned that it's expensive. – Paul Price Jun 05 '20 at 19:50
10

Looking at the np.percentile code it is clear it does nothing special with masked arrays.

def percentile(a, q, axis=None, out=None,
               overwrite_input=False, interpolation='linear', keepdims=False):
    q = array(q, dtype=np.float64, copy=True)
    r, k = _ureduce(a, func=_percentile, q=q, axis=axis, out=out,
                    overwrite_input=overwrite_input,
                    interpolation=interpolation)
    if keepdims:
        if q.ndim == 0:
            return r.reshape(k)
        else:
            return r.reshape([len(q)] + k)
    else:
        return r

Where _ureduce and _percentile are internal functions defined in numpy/lib/function_base.py. So the real action is more complex.

Masked arrays have 2 strategies for using numpy functions. One is to fill - replace the masked values with innocuous ones, for example 0 when doing sum, 1 when doing a product. The other is to compress the data - that is, remove all masked values.

for example:

In [997]: data=np.arange(-5,10)
In [998]: mdata=np.ma.masked_where(data<0,data)

In [1001]: np.ma.filled(mdata,0)
Out[1001]: array([0, 0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [1002]: np.ma.filled(mdata,1)
Out[1002]: array([1, 1, 1, 1, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [1008]: mdata.compressed()
Out[1008]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Which is going to give you the desired percentile? Filling or compressing? Or none. You need to understand the concept of percentile well enough to know how it should apply in the case of your masked values.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Compressed() did the trick for me. Since I needed to fully exclude the NoData values before percentile calculation. – EikeMike Jun 21 '16 at 06:34