7

I use Python and have an array with values 1.0 , 2.0 , 3.0 , 4.0 , 5.0 , 6.0 and np.nan as NoData.

I want to fill all "nan" with a value. This value should be the majority of the surrounding values.

For example:

1 1 1 1 1
1 n 1 2 2
1 3 3 2 1
1 3 2 3 1

"n" shall present "nan" in this example. The majority of its neighbors have the value 1. Thus, "nan" shall get replaced by value 1.

Note, that the holes consisting of "nan" can be of the size 1 to 5. For example (maximum size of 5 nan):

1 1 1 1 1
1 n n n 2
1 n n 2 1
1 3 2 3 1

Here the hole of "nan" have the following surrounding values:

surrounding_values = [1,1,1,1,1,2,1,2,3,2,3,1,1,1] -> Majority = 1

I tried the following code:

from sklearn.preprocessing import Imputer

array = np.array(.......)   #consisting of 1.0-6.0 & np.nan
imp = Imputer(strategy="most_frequent")
fill = imp.fit_transform(array)

This works pretty good. However, it only uses one axis (0 = column, 1 = row). The default is 0 (column), so it uses the majority of the surrounding values of the same column. For example:

Array
2 1 2 1 1
2 n 2 2 2
2 1 2 2 1
1 3 2 3 1

Filled Array
2 1 2 1 1
2 1 2 2 2
2 1 2 2 1
1 3 2 3 1

So here you see, although the majority is 2, the majority of the surrounding column-values is 1 and thus it becomes 1 instead of 2.

As a result, I need to find another method using python. Any suggestions or ideas?


SUPPLEMENT:

Here you see the result, after I added the very helpfull improvement of Martin Valgur.

enter image description here

Think of "0" as sea (blue) and of the other values (> 0) as land (red).

If there is a "little" sea surrounded by land (the sea can again have the size 1-5 px) it will get land, as you can successfully see in the result-image. If the surrounded sea is bigger than 5px or outside the land, the sea wont gain land (This is not visible in the image, because it is not the case).

If there is 1px "nan" with more majority of sea than land, it will still become land (In this example it has 50/50).

The following picture shows what I need. At the border between sea (value=0) and land (value>0), the "nan"-pixel needs to get the value of the majority of the land-values.

enter image description here

That sounds difficult and I hope that I could explain it vividly.

3 Answers3

3

A possible solution using label() and binary_dilation() from scipy.ndimage:

import numpy as np
from scipy.ndimage import label, binary_dilation
from collections import Counter

def impute(arr):
    imputed_array = np.copy(arr)

    mask = np.isnan(arr)
    labels, count = label(mask)
    for idx in range(1, count + 1):
        hole = labels == idx
        surrounding_values = arr[binary_dilation(hole) & ~hole]
        most_frequent = Counter(surrounding_values).most_common(1)[0][0]
        imputed_array[hole] = most_frequent

    return imputed_array

EDIT: Regarding your loosely-related follow-up question, you can extend the above code to achieve what you are after:

import numpy as np
from scipy.ndimage import label, binary_dilation, binary_closing

def fill_land(arr):
    output = np.copy(arr)

    # Fill NaN-s
    mask = np.isnan(arr)
    labels, count = label(mask)
    for idx in range(1, count + 1):
        hole = labels == idx
        surrounding_values = arr[binary_dilation(hole) & ~hole]
        output[hole] = any(surrounding_values)

    # Fill lakes
    land = output.astype(bool)
    lakes = binary_closing(land) & ~land
    labels, count = label(lakes)
    for idx in range(1, count + 1):
        lake = labels == idx
        output[lake] = lake.sum() < 6

    return output
Martin Valgur
  • 5,793
  • 1
  • 33
  • 45
  • The line: **imputed_array[hole] = mode(surrounding_values).mode[0]** gives the _AttributeError: 'tuple' object has no attribute 'mode'_ – Johannes-R-Schmid Jan 09 '17 at 15:42
  • Ok, seems that you have a different version of SciPy perhaps. You can use `collections.Counter` or any other method to find the most frequent value instead (see http://stackoverflow.com/a/28528632/2997179). I modified the code to use `Counter`. – Martin Valgur Jan 09 '17 at 15:49
  • I added sample code for your follow-up question. It's possible it won't behave as you want in some edge-cases you did not describe, so tweak it as you see fit. – Martin Valgur Jan 09 '17 at 16:55
  • I added it as supplement to my question and described it better. I alreadysolved the problem of the little included seas, as it was a problem in the previous code. However, the remaining problem is the border between sea and lake, that the sea cant gain pixels. – Johannes-R-Schmid Jan 09 '17 at 16:58
1

i dont found any lib, so i wrote a function, if case all None in the middle of the array you can use these

import numpy as np
from collections import Counter


def getModulusSurround(data):

    tempdata = list(filter(lambda x: x, data))
    c = Counter(tempdata)
    if c.most_common(1)[0][0]:
        return(c.most_common(1)[0][0])


def main():

    array = [[1, 2, 2, 4, 5],
             [2, 3, 4, 5, 6],
             [3, 4, None, 6, 7],
             [1, 4, 2, 3, 4],
             [4, 6, 2, 2, 4]]

    array = np.array(array)

    for i in range(5):
        for j in range(5):
            if array[i,j] == None:

                temparray = array[i-1:i+2,j-1:j+2]
                array[i,j] = getModulusSurround(temparray.flatten())

    print(array)

main()
0

After the incredible help of Martin Valgur, I have the result I need.

Therefore, I added the following lines to Martins code:

from scipy.ndimage import label, binary_dilation
from scipy.stats import mode

def impute(arr):
    imputed_array = np.copy(arr)

    mask = np.isnan(arr)
    labels, count = label(mask)
    for idx in range(1, count + 1):
        hole = labels == idx
        surrounding_values = arr[binary_dilation(hole) & ~hole]

        sv_list = np.ndarray.tolist(surrounding_values) #!
        for sv in sv_list:  #!
            if sv == 0:
                sv_list.remove(sv)
        surrounding_values = np.array(sv_list)

        imputed_array[hole] = mode(surrounding_values).mode[0]

    return imputed_array