Slicing along NumPy ndarray axis with varying indices

Question

Given a ndarray ar of shape (n, m) I want to "extract" subsequences along axis 1 of length k with k<m. In case of a known starting index start for the subsequences of length k this can be solved with new_ar = ar[:, start:end] (or just start:start+k).

However, what if I have a list start_list and an end_list of length n (or just the start_list, since the length of the subsequence is known anyway), which contains the starting indices (and ending indices) of the subsequences I want to extract? Intuitively I tried ar[:, start_list:end_list], but this throws TypeError: slice indices must be integers or None or have an __index__ method.

What would be a solution to this problem without the usage of loops and leveraging NumPys methods? For my problem the for-loop took 30 mins, but this has to have a NumPy-style 5ms solution since it's just indexing.

[edit]: Since the problem is probably better understood with code (thank you for the hints), I'll try to make it more compact what I want and show what I did to solve it with a loop.

I have an ndarray of shape (40450, 200000), representing 40450 signals of length 200000 each. The signals are shifted and I want to align them. So I want to extract subsequences of length say 190000 from each of the 40450 sequences. For this, I have a list start_list of length 40450, containing the starting indices for the subsequences (each of the 40450 subsequences I want to extract has a different starting point in the original sequence of length 200000).

I can solve this with a for loop (ar contains the original sequences, start_list the starting indices):

k = 190000
ar_new = np.zeros((40450, k))
for i in range(ar_new.shape[0]):
    ar_new[i] = ar[i, start_list[i]:start_list[i]+k]

If e. g. start_list[0] is 0, this means that I need ar[0, 0:190000], if start_list[10000] is 1337, this means I need ar[10000, 1337:1337+190000] etc.

But this takes >30 mins for my case and I am sure it can somehow be solved with NumPy built-in methods/some slicing magic.

Is the length `k` the same for all? If not it is definitely unavoidable to use loops, otherwise you'd get jagged arrays that are not supported by numpy. — flawr, Oct 13 '22 at 07:35
Could you show your code and a typical example of a list with starting points? — Vitalizzare, Oct 13 '22 at 08:13
Added code & an example in my [edit]. Thanks for hinting that I need to be more specific! — derkurt, Oct 13 '22 at 08:34

gboffi · Answer 1 · 2022-10-13T09:07:06.800

After some trials

In [14]: a = np.array(range(200000), dtype=float)
    ...: b = np.array(range(200000), dtype=float)
    ...: start, k = 100, 190000

In [15]: %timeit for _ in range(1000): a[:k] = a[s:s+k]
26.4 ms ± 9.04 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: %timeit for _ in range(1000): b[:k] = a[s:s+k]
44.8 ms ± 902 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I am thinking ① if you can do without the non-aligned data, overwriting seems FASTER ② at any rate, if the process is contained in RAM, I'd expect to have my results in the range 1÷10 seconds, not 30 minutes ③ if your problem is swapping, overwriting avoids allocating approximately 4*4E4*2E5 ⇒ 32E9 bytes of memory.

Vitalizzare · Answer 2 · 2022-10-30T07:30:40.687

We can look at the original 2-dimentional array as at a 3-dimentional structure. With help of as_strided we can create a 3d-view to the array, where the first dimension is equal to the original one, the second dimension is for iterating over possible starting positions of a subrow, and the third one is for iterating over the values in a subrow:

from numpy.lib.stride_tricks import as_strided

# test data
n, m = 5, 10
arr = np.arange(n*m).reshape(n, m)
k = 5
start_list = [0, 1, 2, 1, 0]

# main code
n, m = arr.shape
isize = arr.dtype.itemsize
x = 1 + m - k    # a supporting intermediate dimension

assert k < m
assert len(start_list) == n
assert all(0 <= i < x for i in start_list)

# create a view to the original data with modified shape and strides
arr_modified = as_strided(arr, shape=(n,x,k), strides=(m*isize, isize, isize))

# from each row in arr select a k-length part 
# starting from the corresponding item in start_list
arr_new = arr_modified[range(n), start_list]

Slicing along NumPy ndarray axis with varying indices

2 Answers2