2

I'm working with large data sets. I'm trying to use the NumPy library where I can or python features to process the data sets in an efficient way (e.g. LC).

First I find the relevant indexes:

dt_temp_idx = np.where(dt_diff > dt_temp_th)

Then I want to create a mask containing for each index a sequence starting from the index to a stop value, I tried:

mask_dt_temp = [np.arange(idx, idx+dt_temp_step) for idx in dt_temp_idx]

and:

  mask_dt_temp = [idxs for idx in dt_temp_idx for idxs in np.arange(idx, idx+dt_temp_step)]

but it gives me the exception:

The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Example input:

indexes = [0, 100, 1000]

Example output with stop values after 10 integers from each indexes:

list = [0, 1, ..., 10, 100, 101, ..., 110, 1000, 1001, ..., 1010]

1) How can I solve it? 2) Is it the best practice to do it?

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Luca
  • 85
  • 1
  • 10
  • What exactly is the end-goal? How do you plan to use the mask or the range of indices? – Divakar Feb 11 '20 at 14:57
  • I will use the mask to drop\delete entries from an array. – Luca Feb 11 '20 at 14:59
  • What if some sequences overlap? For example for an input of `[0, 5, 100]`, is the expected output list `[0, 1, ..., 15, 100, ..., 110]`? – Serge Ballesta Feb 11 '20 at 15:03
  • I don't think any sequences overlap. On the contrary your suggested expected output may be fine for me. – Luca Feb 11 '20 at 15:07
  • Does this answer your question? [ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()](https://stackoverflow.com/questions/10062954/valueerror-the-truth-value-of-an-array-with-more-than-one-element-is-ambiguous) – AMC Feb 11 '20 at 19:26
  • From the docs for [`numpy.where()`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html): _When only condition is provided, this function is a shorthand for np.asarray(condition).nonzero(). Using nonzero directly should be preferred, as it behaves correctly for subclasses._ – AMC Feb 11 '20 at 19:27

2 Answers2

1

Using masks (boolean arrays) are efficient being memory-efficient and performant too. We will make use of SciPy's binary-dilation to extend the thresholded mask.

Here's a step-by-step setup and solution run-

In [42]: # Random data setup
    ...: np.random.seed(0)
    ...: dt_diff = np.random.rand(20)
    ...: dt_temp_th = 0.9

In [43]: # Get mask of threshold crossings
    ...: mask = dt_diff > dt_temp_th

In [44]: mask
Out[44]: 
array([False, False, False, False, False, False, False, False,  True,
       False, False, False, False,  True, False, False, False, False,
       False, False])

In [45]: W = 3 # window size for extension (edit it according to your use-case)

In [46]: from scipy.ndimage.morphology import binary_dilation

In [47]: extm = binary_dilation(mask, np.ones(W, dtype=bool), origin=-(W//2))

In [48]: mask
Out[48]: 
array([False, False, False, False, False, False, False, False,  True,
       False, False, False, False,  True, False, False, False, False,
       False, False])

In [49]: extm
Out[49]: 
array([False, False, False, False, False, False, False, False,  True,
        True,  True, False, False,  True,  True,  True, False, False,
       False, False])

Compare mask against extm to see how the extension takes place.

As, we can see the thresholded mask is extended by window-size W on the right side, as is the expected output mask extm. This can be use to mask out those in the input array : dt_diff[~extm] to simulate the deleting/dropping of the elements from the input following boolean-indexing or inversely dt_diff[extm] to simulate selecting those.

Alternatives with NumPy based functions

Alternative #1

extm = np.convolve(mask, np.ones(W, dtype=int))[:len(dt_diff)]>0

Alternative #2

idx = np.flatnonzero(mask)
ext_idx = (idx[:,None]+ np.arange(W)).ravel()

ext_mask = np.ones(len(dt_diff), dtype=bool)
ext_mask[ext_idx[ext_idx<len(dt_diff)]] = False
 
# Get filtered o/p
out = dt_diff[ext_mask]
Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Do you mind helping me writing the same algorithm without using any high-level function? I would like to do it "by hand" at low-level to learn more – Luca Feb 11 '20 at 15:23
  • @Luca Do you want to drop or select those from the extended mask? – Divakar Feb 11 '20 at 15:25
  • I will use the mask to select the entries from an array and drop them. They are considered invalid data to me, I need to detect and remove them – Luca Feb 11 '20 at 15:27
  • @Luca Check out just added `Alternative #1`. – Divakar Feb 11 '20 at 15:31
0

dt_temp_idx is a numpy array, but still a Python iterable so you can use a good old Python list comprehension:

lst = [ i for j in dt_temp_idx for i in range(j, j+11)]

If you want to cope with sequence overlaps and make it back a np.array, just do:

result = np.array({i for j in dt_temp_idx for i in range(j, j+11)})

But beware the use of a set is robust and guarantee no repetition but it could be more expensive that a simple list.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252