Finding Indices for Repeat Sequences in NumPy Array

Question

This is a follow up to a previous question. If I have a NumPy array [0, 1, 2, 2, 3, 4, 2, 2, 5, 5, 6, 5, 5, 2, 2], for each repeat sequence (starting at each index), is there a fast way to to then find all matches of that repeat sequence and return the index for those matches?

Here, the repeat sequences are [2, 2] and [5, 5] (note that the length of the repeat is specified by the user but will be the same length and can be much greater than 2). The repeats can be found at [2, 6, 8, 11, 13] via:

def consec_repeat_starts(a, n):
    N = n-1
    m = a[:-1]==a[1:]
    return np.flatnonzero(np.convolve(m,np.ones(N, dtype=int))==N)-N+1

But for each unique type of repeat sequence (i.e., [2, 2] and [5, 5]) I want to return something like the repeat followed by the indices for where the repeat is located:

[([2, 2], [2, 6, 13]), ([5, 5], [8, 11])]

Update

Additionally, given the repeat sequence, can you return the results from a second array. So, look for [2, 2] and [5, 5] in:

[2, 2, 5, 5, 1, 4, 9, 2, 5, 5, 0, 2, 2, 2]

And the function would return:

[([2, 2], [0, 11, 12]), ([5, 5], [2, 8]))]

So, if we have `[0, 1, 2, 2, 2, 3, 4,..]`, would it be `[([2, 2], [2, 3, 6, 13]), ...`? — Divakar, Jan 09 '20 at 15:41
The queries sequences (in the example `[2, 2]` and `[5, 5]`) are always the same number repeated twice (or more), or can it have multiple different numbers? — jdehesa, Jan 09 '20 at 16:13
They will always be the same constant value within each repeat but always with the same length. For a different array for `m = 3`, the repeating sequences might be `[3.4, 3.4, 3.4], [10.1, 10.1, 10.1], [9.6, 9.6, 9.6]` — slaw, Jan 09 '20 at 16:19

Divakar · Accepted Answer · 2020-01-09T16:58:56.820

Here's a way to do so -

def group_consec(a, n):
    idx = consec_repeat_starts(a, n)
    b = a[idx]
    sidx = b.argsort()
    c = b[sidx]
    cut_idx = np.flatnonzero(np.r_[True, c[:-1]!=c[1:],True])
    idx_s = idx[sidx]
    indices = [idx_s[i:j] for (i,j) in zip(cut_idx[:-1],cut_idx[1:])]
    return c[cut_idx[:-1]], indices

# Perform lookup in another array, b
n = 2
v_a,indices_a = group_consec(a, n)
v_b,indices_b = group_consec(b, n)

idx = np.searchsorted(v_a, v_b)
idx[idx==len(v_a)] = 0
valid_mask = v_a[idx]==v_b
common_indices = [j for (i,j) in zip(valid_mask,indices_b) if i]
common_val = v_b[valid_mask]

Note that for simplicity and ease of usage, the first output arg off group_consec has the unique values per sequence. If you need them in (val, val,..) format, simply replicate at the end. Similarly, for common_val.

You probably missed this in my comment above but how would the function need to changed in order to find repeated sequences in array `a` but to find the indices for the repeated sequences in a second array, `b` (where `a` and `b` are different)? Perhaps, `b` would be an optional argument to the input function — slaw, Jan 09 '20 at 16:34

Finding Indices for Repeat Sequences in NumPy Array

1 Answers1

Linked