How to efficiently implement a kernel-based outlier detection on 2D arrays that includes a condition in Python?

Question

I have a satellite image which is quite noisy. I wanted to try a Water Anomaly Filter proposed here. It is straight-foreword and appears to be suited for my use case, as it expects that the "real" structures on the ground are much larger than the noise, which appears only in isolated pixels. The proposed algorithm works as follow:

let a 5x5 pixel kernel run over the input image
calculate the mean and standard deviation (SD) of each kernel subset, but exclude the center pixel
remove outliers (i.e. outside mean +/- 1 SD) and do step 3 again to calculate mean and SD without outliers (mean_wo and SD_wo)
if the value of the pixel in the center of the kernel is outside mean_wo +/- 1 SD_wo, replace it by mean_wo. Else keep the original value.

This may sound complicated, but it keeps as many original data values as possible, which is crucial for me.

The original algorithm is implemented in IDL, but I need it in Python. My data are in netCDF-files and I usually work with xarray. I already looked into xarray.DataArray.rolling and xarray.apply_ufunc, but I didn't found a way to implement this algorithm with its kernel and if-statement. I'm also open to solutions with other packges, but I also had no success to implement an if-condition into a kernel with numpy and scipy so far.

Of course, I could implement the algorithm with brute force and let the kernel run over the image with two for loops and so on, but since I have a few satellite images, each with thousands x thousands pixels and with several bands, I would like to find a faster solution.

import xarray as xr
import numpy as np
np.random.seed(42)

# define sample data
data = np.random.rand(25).reshape(5, 5)
data[1,1] = data[3,3] = 5
xds = xr.DataArray(data, dims=['y', 'x'], name='raster')
print(xds)

#<xarray.DataArray 'raster' (y: 5, x: 5)>
#array([[0.37454012, 0.95071431, 0.73199394, 0.59865848, 0.15601864],
#       [0.15599452, 5.        , 0.86617615, 0.60111501, 0.70807258],
#       [0.02058449, 0.96990985, 0.83244264, 0.21233911, 0.18182497],
#       [0.18340451, 0.30424224, 0.52475643, 5.        , 0.29122914],
#       [0.61185289, 0.13949386, 0.29214465, 0.36636184, 0.45606998]])
#Dimensions without coordinates: y, x

# apply rolling to apply a mean filter. For sake of simplicity only with a 3x3 kernel size
print(xds.rolling({"x": 3, "y": 3}, center = True).mean())

#<xarray.DataArray 'raster' (y: 5, x: 5)>
#array([[       nan,        nan,        nan,        nan,        nan],
#       [       nan, 1.10026178, 1.19592772, 0.54318239,        nan],
#       [       nan, 0.98416787, 1.59010905, 1.02421734,        nan],
#       [       nan, 0.43098129, 0.96018785, 0.90635209,        nan],
#       [       nan,        nan,        nan,        nan,        nan]])
#Dimensions without coordinates: y, x

# However, I which to implement the filter described above, which should yield (again with a kernel size of 3x3):
#<xarray.DataArray 'raster' (y: 5, x: 5)>
#array([[       nan,        nan,        nan,        nan,        nan],
#       [       nan, 0.61279450, 0.86617615, 0.60111501,        nan],
#       [       nan, 0.96990985, 0.83244264, 0.21233911,        nan],
#       [       nan, 0.30424224, 0.52475643, 0.39464610,        nan],
#       [       nan,        nan,        nan,        nan,        nan]])
#Dimensions without coordinates: y, x

There are numerous posts about how to do that. If you want to use basic vectorized function and the window is not very small, then you can use a [cumsum strategy](https://stackoverflow.com/questions/70368702/rolling-statistics-performance-pandas-vs-numpy-strides/70369323#70369323). Otherwise, you can [use Numba](https://stackoverflow.com/questions/67618680/rolling-mean-of-a-sorted-subarray-of-a-dataframe/67629589#67629589) to do that efficiently (see `col_k_mean(array, 3, 0)` though the code may need to be modified to change the behaviour for the borders). — Jérôme Richard, Dec 22 '22 at 18:01
To add to Jerome tips, if the filter kernel is symmetric, you can use this fact in your algorithm to speed up to ~x2/x4 — dankal444, Dec 27 '22 at 09:21
Thank you @JérômeRichard, I now use Numba as in the example together with Dask, which is a huge improvement in performance, even though it still takes some hours for my huge image stack (5 bands with ~ 40.000 x 50.000 pixels). — eimes, Jan 03 '23 at 10:43
@dankal444: Can you explain what you mean with a symmetric filter kernel and how I can use that to speed up the algorithm? — eimes, Jan 03 '23 at 10:45
AFAIK, symmetric kernels is the other name for [separable filters](https://en.wikipedia.org/wiki/Separable_filter). I am not sure for the x2/x4 for this simple kernel but it is definitively something to try here. Besides, hours of computation is quite big for such an input unless 1. data are stored on a slow storage device or 2. the computation is significantly more expensive than the one exposed here. Your program should be likely memory bound so the computation should last clearly no more than 1 minute for a 5 40.000x50.000x3 images. — Jérôme Richard, Jan 03 '23 at 14:09
@eimes as Jerome said, take a look at his link, there are examples that show that **sometimes** (if kernel is symmetric) instead of single NxN filter you can use two Nx1 filters. Performance gains greater for bigger kernels, but even for 3x3 are significant. — dankal444, Jan 04 '23 at 10:00

How to efficiently implement a kernel-based outlier detection on 2D arrays that includes a condition in Python?

0 Answers0