0

Let's say we have the following data array:

data_array = np.array([[1, 1, 1], [1, 1, 2], [2, 2, 2], [3, 3, 3], [4, 4, 4]], np.int16)

data_array
array([[1, 1, 1],
       [1, 1, 2],
       [2, 2, 2],
       [3, 3, 3],
       [4, 4, 4]])

And we want to mask the array according to the following ranges to be able to apply a calculation on the masked parts:

intervals = [[1, 2], [2, 3], [3, 4]]

We first create an empty array and mask based on the data array so we can combine the results for each masked array:

init = np.zeros((data_array.shape[0], data_array.shape[1]))
result_array = np.ma.masked_where((init == 0), init)

result_array
masked_array(
data=[[--, --, --],
      [--, --, --],
      [--, --, --],
      [--, --, --],
      [--, --, --]],
mask=[[ True,  True,  True],
      [ True,  True,  True],
      [ True,  True,  True],
      [ True,  True,  True],
      [ True,  True,  True]]

With this we can start a for loop that masks the array according the the interval ranges, performs a calculation on the masked array and combines to results to a single result array:

for inter in intervals:

    # Extact the start and en values for interval range
    start_inter = inter[0]
    end_inter = inter[1]

    # Mask the array based on interval range
    mask_init = np.ma.masked_where((data_array > end_inter), data_array)
    masked_array = np.ma.masked_where((mask_init < start_inter), mask_init)

    # Perform a dummy calculation on masked array
    outcome = (masked_array + end_inter) * 100

    # Combine the outcome arrays
    result_array[result_array.mask] = outcome[result_array.mask]

With the following result:

array([[300.0, 300.0, 300.0],
      [300.0, 300.0, 400.0],
      [400.0, 400.0, 400.0],
      [600.0, 600.0, 600.0],
      [800.0, 800.0, 800.0]])

The question I have is, how can the same result be achieved without using this for loop? So applying the masking and calculation for the whole data_array in a single operation. Note that the calculation's variables change with each mask. Is it possible to apply a vectorized approach to this problem? I would imagine numpy_indexed could be of some help. Thank you.

cf2
  • 581
  • 1
  • 7
  • 17
  • So you need that each value on `data_array` is affected only by the first interval in which it is included? That is, for example in this case `2` is both in the first and second interval (since, as you are defining them, your intervals are inclusive on both ends), but every `2` gets transformed into `400` because only the first interval is considered for it (otherwise I suppose you would get `900` from adding the result for the first and the second intervals).This is a requirement? – jdehesa Feb 15 '19 at 12:50
  • Good point. This is indeed a requirement, so we do not add the outcome if the value is in multiple masks (in this case `2`). We just use the outcome of the last interval (for the case of `2` this is indeed `400`). In the real use-case we could use non-overlapping intervals, for this example we could use `intervals = [[1, 1.9], [2, 2.9], [3, 4]]` – cf2 Feb 15 '19 at 13:03

1 Answers1

1

If the intervals can be made non-overlapping, then you could use a function like this:

import numpy as np

def func(data_array, intervals):
    data_array = np.asarray(data_array)
    start, end = np.asarray(intervals).T
    data_array_exp = data_array[..., np.newaxis]
    mask = (data_array_exp >= start) & (data_array_exp <= end)
    return np.sum((data_array_exp + end) * mask * 100, axis=-1)

The result should be the same as with the original code in that case:

import numpy as np

def func_orig(data_array, intervals):
    init = np.zeros((data_array.shape[0], data_array.shape[1]))
    result_array = np.ma.masked_where((init == 0), init)
    for inter in intervals:
        start_inter = inter[0]
        end_inter = inter[1]
        mask_init = np.ma.masked_where((data_array > end_inter), data_array)
        masked_array = np.ma.masked_where((mask_init < start_inter), mask_init)
        outcome = (masked_array + end_inter) * 100
        result_array[result_array.mask] = outcome[result_array.mask]
    return result_array.data

data_array = np.array([[1, 1, 1], [1, 1, 2], [2, 2, 2], [3, 3, 3], [4, 4, 4]], np.int16)
intervals = [[1, 1.9], [2, 2.9], [3, 4]]
print(np.allclose(func(data_array, intervals), func_orig(data_array, intervals)))
# True
jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • Did some thorough checking and works beautifully. With larger arrays it can get memory intensive, but with small arrays it's perfect. – cf2 Feb 18 '19 at 20:09