Faster way to find same integers in sequence in numpy array

Question

Right now I am just looping through using np.nditer() and comparing to the previous element. Is there a (vectorised) approach which is faster?

Added bonus is the fact that I don't always have to go to the end of the array; as soon as a sequence of max_len has been found I am done searching.

import numpy as np

max_len = 3
streak = 0
prev = np.nan

a = np.array([0, 3, 4, 3, 0, 2, 2, 2, 0, 2, 1])

for c in np.nditer(a):
  if c == prev:
      streak += 1
      if streak == max_len:
          print(c)
          break
  else:
      prev = c
      streak = 1

Alternative I thought about is using np.diff() but this just shifts the problem; we are now looking for a sequence of zeroes in its result. Also I doubt it will be faster since it will have to calculate the difference for every integer whereas in practice the sequence will occur before reaching the end of the list more often than not.

What's the variable you are interested in finally/ final expected output? — Divakar, Feb 11 '20 at 08:49
The array either contains a sequence or it doesn't. I'm interested in knowing whether a sequence occurs and if it does, what integer it is. — gosuto, Feb 11 '20 at 09:08

score 1 · Answer 1 · answered Feb 11 '20 at 08:58

You can use groupby from the itertools package.

import numpy as np
from itertools import groupby

max_len = 3
best = ()

a = np.array([0, 3, 4, 3, 0, 2, 2, 2, 0, 2, 1])

for k, g in groupby(a):
    tup_g = tuple(g)
    if tup_g==max_len:
        best = tup_g
        break
    if len(tup_g) > len(best):
        best = tup_g

best
# returns:
(2, 2, 2)

Divakar · Answer 2 · 2020-02-11T09:37:26.480

Assuming you are looking for the element that appears for at least max_len times consecutively, here's one NumPy based way -

m = np.r_[True,a[:-1]!=a[1:],True]
idx0 = np.flatnonzero(m)
m2 = np.diff(idx0)>=max_len
out = None # None for no such streak found case
if m2.any():
    out = a[idx0[m2.argmax()]]

Another with binary-dilation -

from scipy.ndimage.morphology import binary_erosion

m = np.r_[False,a[:-1]==a[1:]]
m2 = binary_erosion(m, np.ones(max_len-1, dtype=bool))
out = None
if m2.any():
    out = a[m2.argmax()]

Finally, for completeness, you can also look into numba. Your existing code would work as it is, with a direct-looping over a, i.e. for c in a:.

Xhattam · Answer 3 · 2020-02-11T09:27:47.460

You could create sub-arrays of length max_length, moving one position to the right each time (like ngrams), and check if the sum of one sub_array divided by max_length is equal to the first element of that sub-array.

If that's True, then you have found your consecutive sequence of integers of length max_length.

def get_conseq(array, max_length):
    sub_arrays = zip(*[array[i:] for i in range(max_length)])
    for e in sub_arrays:
        if sum(e) / len(e) == e[0]:
            print("Found : {}".format(e))
            return e
    print("Nothing found")
    return []

For example, this array [1,2,2,3,4,5], with max_length = 2, will be 'split' like this: [1,2] [2,2] [2,3] [3,4] [4,5]

On the second element, [2,2], the sum is 4, divided by max_length gives 2, and that matches the first element of that subgroup, and the function returns.

You can break if that's what you prefer to do, instead of returning like I do.

You could also add a few rules to capture edge cases, to make things clean (empty array, max_length superior to the length of the array, etc).

Here are a few example calls:

>>> splits([1,2,3,4,5,6], 2)
Nothing found

>>> splits([1,2,2,3,4,5,6], 3)
Nothing found

>>> splits([1,2,3,3,3], 3)
Found : [3, 3, 3]

>>> splits([1,2,2,3,3], 2)
Found : [2, 2]

Hope this helps !

senderle · Accepted Answer · 2020-02-12T11:05:40.463

I developed a numpy-only version that works, but after testing, I found that it performs quite poorly because it can't take advantage of short-circuiting. Since that's what you asked for, I describe it below. However, there is a much better approach using numba with a lightly modified version of your code. (Note that all of these return the index of the first match in a, rather than the value itself. I find that approach more flexible.)

@numba.jit(nopython=True)
def find_reps_numba(a, max_len):
    streak = 1
    val = a[0]
    for i in range(1, len(a)):
        if a[i] == val:
            streak += 1
            if streak >= max_len:
                return i - max_len + 1
        else:
            streak = 1
            val = a[i]
    return -1

This turns out to be ~100x faster than the pure Python version.

The numpy version uses the rolling window trick and the argmax trick. But again, this turns out to be far slower than even the pure Python version, by a substantial ~30x.

def rolling_window(a, window):
    a = numpy.ascontiguousarray(a)  # This approach requires a C-ordered array
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def find_reps_numpy(a, max_len):
    windows = rolling_window(a, max_len)
    return (windows == windows[:, 0:1]).sum(axis=1).argmax()

I tested both of these against a non-jitted version of the first function. (I used Jupyter's %%timeit feature for testing.)

a = numpy.random.randint(0, 100, 1000000)

%%timeit
find_reps_numpy(a, 3)
28.6 ms ± 553 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
find_reps_orig(a, 3)
4.04 ms ± 40.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
find_reps_numba(a, 3)
8.29 µs ± 89.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Note that these numbers can vary dramatically depending on how deep into a the functions have to search. For a better estimate of expected performance, we can regenerate a new set of random numbers each time, but it's difficult to do so without including that step in the timing. So for comparison here, I include the time required to generate the random array without running anything else:

a = numpy.random.randint(0, 100, 1000000)
9.91 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

a = numpy.random.randint(0, 100, 1000000)
find_reps_numpy(a, 3)
38.2 ms ± 453 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

a = numpy.random.randint(0, 100, 1000000)
find_reps_orig(a, 3)
13.7 ms ± 404 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

a = numpy.random.randint(0, 100, 1000000)
find_reps_numba(a, 3)
9.87 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As you can see, find_reps_numba is so fast that the variance in the time it takes to run numpy.random.randint(0, 100, 1000000) is much larger — hence the illusory speedup between the first and last tests.

So the big moral of the story is that numpy solutions aren't always best. Sometimes even pure Python is faster. In those cases, numba in nopython mode may be the best option by far.

Great answer, thank you. Confirms the fact that it is difficult to find a quicker way to do this. Also that `%%timeit` trick in IPython is a good one, the traditional method of using `timeit` is so tedious! — gosuto, Feb 11 '20 at 19:15
Really I should probably start using [perfplot](https://pypi.org/project/perfplot/), but timeit appeals to my lazy side too effectively. — senderle, Feb 11 '20 at 20:31

Faster way to find same integers in sequence in numpy array

4 Answers4