4

Say that I have a NumPy array:

a = np.array([0, 1, 2, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 11, 12, 13, 13, 13, 14, 15])

And I have a length m = 2 that the user specifies in order to see if there are any repeats of that length within the time series. In this case, the repeats of length m = 2 are:

[2, 2]
[5, 5]
[9, 9]
[9, 9]
[13, 13]

And the user can change this to m = 3 and the repeats of length m = 3 are:

[9, 9, 9]
[13, 13, 13]

I need a function that either returns the index of where a repeat is found or None. So, for m = 3 the function would return the following NumPy array of starting indices:

[11, 17]

And for m = 4 the function would return None. What's the cleanest and fastest way to accomplish this?

Update Note that the array does not have to be sorted and we are not interested in the result after a sort. We only want the result from the unsorted array. Your result for m = 2 should be the same for this array:

b = np.array([0, 11, 2, 2, 3, 40, 5, 5, 16, 7, 80, 9, 9, 9, 1, 11, 12, 13, 13, 13, 4, 5])
slaw
  • 6,591
  • 16
  • 56
  • 109
  • Is `a` always sorted? – Divakar Jan 09 '20 at 11:18
  • 1
    And, would it be `(2, 6, 11, 12, 17, 18)` for `m=2`? – Divakar Jan 09 '20 at 11:22
  • Great questions. `a` is NEVER sorted and, yes, that is the returned array for `m = 2` – slaw Jan 09 '20 at 11:26
  • Can we sorted before? – Jose Macedo Jan 09 '20 at 11:30
  • Or it's impossible to sort the array? – Jose Macedo Jan 09 '20 at 11:31
  • It's not possible to sort but it's also not necessary. I'm not interested in repeats after a sort. I'm only interested in consecutive repeats as the array is. – slaw Jan 09 '20 at 11:36
  • Ok! So the solution that i posted should work! – Jose Macedo Jan 09 '20 at 11:41
  • @Divakar it looks like one of your previous solutions to a similar problem (https://stackoverflow.com/questions/58221268/count-number-of-repeated-elements-in-a-row-in-a-numpy-array) is close or at least somewhat related. In my case, I need (1) indices and (2) the ability to find repeats of a specific length – slaw Jan 09 '20 at 11:54
  • Does this answer your question? [find length of sequences of identical values in a numpy array (run length encoding)](https://stackoverflow.com/questions/1066758/find-length-of-sequences-of-identical-values-in-a-numpy-array-run-length-encodi) – Daniel F Jan 09 '20 at 12:19

6 Answers6

3

Approach #1

We could leverage 1D convolution for a vectorized solution -

def consec_repeat_starts(a, n):
    N = n-1
    m = a[:-1]==a[1:]
    return np.flatnonzero(np.convolve(m,np.ones(N, dtype=int))==N)-N+1

Sample runs -

In [286]: a
Out[286]: 
array([ 0,  1,  2,  2,  3,  4,  5,  5,  6,  7,  8,  9,  9,  9, 10, 11, 12,
       13, 13, 13, 14, 15])

In [287]: consec_repeat_starts(a, 2)
Out[287]: array([ 2,  6, 11, 12, 17, 18])

In [288]: consec_repeat_starts(a, 3)
Out[288]: array([11, 17])

In [289]: consec_repeat_starts(a, 4)
Out[289]: array([], dtype=int64)

Approach #2

We could also make use of binary-erosion -

from scipy.ndimage.morphology import binary_erosion

def consec_repeat_starts_v2(a, n):
    N = n-1
    m = a[:-1]==a[1:]
    return np.flatnonzero(binary_erosion(m,[1]*N))-(N//2)
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • 1
    Dang, @Divakar! Again, your solution(s) is simple, elegant, and I've learned a ton in only a few lines of code. Thank you! – slaw Jan 09 '20 at 13:55
  • do the answers matter if the input array are `np.float64` instead? – slaw Jan 09 '20 at 14:51
  • @slaw Shouldn't. It would look for exact duplicates, so with floating point math if you are closeness, you need to bring in tolerance when comparing. – Divakar Jan 09 '20 at 14:53
  • I should probably ask this in another question but after you get the indices, for each repeat sequence (starting at each index), is there a fast way to to then find all matches of that repeat sequence and return the index for those matches? So, if the array were `[0, 1, 2, 2, 3, 4, 2, 2, 5, 5, 6, 5, 5, 2, 2]`, `consec_repeat_starts` would return `[2, 6, 8, 11, 13]`. Then, for each unique type of repeat sequence (i.e., `[2, 2]` and `[5, 5]`) I want to return something like the repeat followed by the indices for where the repeat is located `([2, 2], [2, 6, 13]), ([5, 5], [8, 11])` – slaw Jan 09 '20 at 15:21
  • @slaw Yeah, that would involve some more work. Better to ask a new question, I think. – Divakar Jan 09 '20 at 15:23
  • Okay, I've reposted the question here if you wouldn't mind taking a look: https://stackoverflow.com/questions/59667326/finding-indices-for-repeat-sequences-in-numpy-array – slaw Jan 09 '20 at 15:32
0

I've come up with this solution, maybe it's not the cleanest.

I use a regular array instead of a numpy one, but it should work the same way.

a = [0, 1, 2, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 11, 12, 13, 13, 13, 14, 15]
m = 2


def index_repeated_length( list, length ) :
    previous = None
    count = 1
    for i, element in enumerate(list) :
        if element == previous :
            count += 1
        else :
            previous = element
            count = 1

        if count >= length :
            yield i - length + 1

print( list( index_repeated_length(a, m) ) )

Output:

[2, 6, 11, 12, 17, 18]

Since it's never sorted you have to iterate through it, so the complexity should be linear O(n)

Joan Puigcerver
  • 104
  • 1
  • 13
0

I found this one, quite simillar to the otherone posted, but it should work. It only work if the array is sorted

def find(a,m):
    index=[]
    prev=0
    n =1
    for i in range(1,len(a)):
        if a[prev] == a[i]:
            n+=1
            if(m==n):
                index.append(i-m+1)
                n=1
        else:
            n=1
        prev=i
    return index if len(index)>0 else None
find(a,3)
Jose Macedo
  • 308
  • 1
  • 11
0

If you just want to check for consecutive similar values in an unsorted array this method will still work:

import numpy as np

nums = np.array([11, 0, 1, 2, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 11, 12, 13, 13, 13, 14, 15])

m = 3

equals = []

for ii in range(len(nums)):
    if ii >= m:
        sliced = nums[ii-m:ii]
        if np.unique(sliced).size == 1:
             equals.append(sliced)

print(np.array(equals))

This will give the output for m=2:

[[ 2  2]
[ 5  5]
[ 9  9]
[ 9  9]
[13 13]
[13 13]]

This will give the output for m=3:

[[ 9  9  9]
[13 13 13]]
0

We can also use a recursive function:

def recursive_repeat(arr, k):
    if k == 1:
        return np.flatnonzero(arr)
    else:
        new_arr = np.zeros(arr.size - 1)
        mask = arr[1:] == arr[:-1]
        new_arr[mask] = arr[:-1][mask]
        return recursive_repeat(new_arr, k-1)


recursive_repeat(a, 2)
Out[]: array([ 2,  6, 11, 12, 17, 18], dtype=int64)

recursive_repeat(a, 3)
Out[]: array([11, 17], dtype=int64)
Daniel F
  • 13,620
  • 2
  • 29
  • 55
0

a numba njit'ed function:

import numpy as np
import numba as nb

@nb.njit
def find_repeats(arr, n):
    indices = []
    current = arr[0]
    count = 0
    for idx, item in enumerate(arr[1:]):
        if item == current:
            count += 1
            if count >= n-1:
                indices.append(idx)
        else:
            count = 0
            current = item
    return indices if indices else None

b = np.array([0, 11, 2, 2, 3, 40, 5, 5, 16, 7, 80, 9, 9, 9, 1, 11, 12, 13, 13, 13, 4, 5])

print(find_repeats(b, 2))
# [2, 6, 11, 12, 17, 18]

print(find_repeats(b, 3))
# [12, 18]

Less elegant but fast, especially for smaller array sizes - simple_benchmark comparison vs. Divakar's functions: benchmark

FObersteiner
  • 22,500
  • 8
  • 42
  • 72