Find duplicated sequences in numpy.array or pandas column

Question

For example, I have got an array like this:

([  1,  5,  7,  9,  4,  6,  3,  3,  7,  9,  4,  0,  3,  3,  7,  8,  1, 5 ])

I need to find all duplicated sequences , not values, but sequences of at least two values one by one.

The result should be like this:

of length 2: [1, 5] with indexes (0, 16);
of length 3: [3, 3, 7] with indexes (6, 12); [7,  9,  4] with indexes (2, 8)

The long sequences should be excluded, if they are not duplicated. ([5, 5, 5, 5]) should NOT be taken as [5, 5] on indexes (0, 1, 2)! It's not a duplicate sequence, it's one long sequence.

I can do it with pandas.apply function, but it calculates too slow, swifter did not help me.

And in real life I need to find all of them, with length from 10 up to 100 values one by one on database with 1500 columns with 700 000 values each. So i really do need a vectorized decision.

Is there a vectorized decision for finding all at once? Or at least for finding only 10-values sequences? Or only 4-values sequences? Anything, that will be fully vectorized?

do you have a maximum length for your sequence. like if any (not like in your example), do you also want the one that 5 numbers? 10? — Ben.T, May 27 '21 at 17:46
With that much data and the fact that you need to find all of them, it's almost a given that this is going to be slow. Quite a lot of memory usage as well. — Camilo Martinez M., May 27 '21 at 17:56
Ok, I can let it be slow, if it will be at least vectorized. And we can be satisfied with at least vectorized finding duplicates of one given length, for example, 10 values one by one (it will automatically include parts of sequences with 11, 12 ... 100 values) — Anton Makarov, May 27 '21 at 18:01

Camilo Martinez M. · Answer 1 · 2021-05-29T01:36:35.433

One possible implementation (although not fully vectorized) that finds all sequences of size n that appear more than once is the following:

import numpy as np

def repeated_sequences(arr, n):
    Na = arr.size
    r_seq = np.arange(n)
    n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]
    unique_seqs = np.unique(n_seqs, axis=0)
    comp = n_seqs == unique_seqs[:, None]
    M = np.all(comp, axis=-1)

    if M.any():
        matches = np.array(
            [np.convolve(M[i], np.ones((n), dtype=int)) for i in range(M.shape[0])]
        ) 
        repeated_inds = np.count_nonzero(matches, axis=-1) > n
        repeated_matches = matches[repeated_inds]
        idxs = np.argwhere(repeated_matches > 0)[::n]
        grouped_idxs = np.split(
            idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
        )
    else:
        return [], []

    return unique_seqs[repeated_inds], grouped_idxs

In theory, you could replace

matches = np.array(
    [np.convolve(M[i], np.ones((n), dtype=int)) for i in range(M.shape[0])]
)

with

matches = scipy.signal.convolve(
    M, np.ones((1, n), dtype=int), mode="full"
).astype(int)

which would make the whole thing "fully vectorized", but my tests showed that this was 3 to 4 times slower than the for-loop. So I'd stick with that. Or simply,

matches = np.apply_along_axis(np.convolve, -1, M, np.ones((n), dtype=int))

which does not have any significant speed-up, since it's basically a hidden loop (see this).

This is based off @Divakar's answer here that dealt with a very similar problem, in which the sequence to look for was provided. I simply made it so that it could follow this procedure for all possible sequences of size n, which are found inside the function with n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]; unique_seqs = np.unique(n_seqs, axis=0).

For example,

>>> a = np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5])
>>> repeated_seqs, inds = repeated_sequences(a, n)
>>> for i, seq in enumerate(repeated_seqs[:10]):
...:    print(f"{seq} with indexes {inds[i]}")
...:
    [3 3 7] with indexes [ 6 12]
    [7 9 4] with indexes [2 8]

Disclaimer

The long sequences should be excluded, if they are not duplicated. ([5, 5, 5, 5]) should NOT be taken as [5, 5] on indexes (0, 1, 2)! It's not a duplicate sequence, it's one long sequence.

This is not directly taken into account and the sequence [5, 5] would appear more than once according to this algorithm. You could do something like this, based off @Paul's answer here, but it involves a loop:

import numpy as np

repeated_matches = np.array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0],
                             [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])

idxs = np.argwhere(repeated_matches > 0)
grouped_idxs = np.split(
    idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
)

>>> print(grouped_idxs)
    [array([ 6,  7,  8, 12, 13, 14], dtype=int64), 
     array([ 7,  8,  9, 10], dtype=int64)]  

# If there are consecutive numbers in grouped_idxs, that means that there is a long 
# sequence that should be excluded. So, you'd have to check for consecutive numbers
filtered_idxs = []
for idx in grouped_idxs:
    if not all((idx[1:] - idx[:-1]) == 1):
        filtered_idxs.append(idx)
        
>>> print(filtered_idxs)
    [array([ 6,  7,  8, 12, 13, 14], dtype=int64)]

Some tests:

>>> n = 3
>>> a = np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5])
>>> %timeit repeated_sequences(a, n)
    414 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> n = 4
>>> a = np.random.randint(0, 10, (10000,))
>>> %timeit repeated_sequences(a, n)
    3.88 s ± 54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> result, _ = repeated_sequences(a, n)
>>> result.shape
    (2637, 4)

This is not the most efficient implementation by far, but it works as a 2D approach. Plus, if there aren't any repeated sequences, it returns empty lists.

EDIT: Full implementation

I vectorized the routine I added in the Disclaimer section as a possible solution to the long sequence problem and ended up with the following:

import numpy as np

# Taken from:
# https://stackoverflow.com/questions/53051560/stacking-numpy-arrays-of-different-length-using-padding
def stack_padding(it):
    def resize(row, size):
        new = np.array(row)
        new.resize(size)
        return new

    row_length = max(it, key=len).__len__()
    mat = np.array([resize(row, row_length) for row in it])
    return mat

def repeated_sequences(arr, n):
    Na = arr.size
    r_seq = np.arange(n)
    n_seqs = arr[np.arange(Na - n + 1)[:, None] + r_seq]
    unique_seqs = np.unique(n_seqs, axis=0)
    comp = n_seqs == unique_seqs[:, None]
    M = np.all(comp, axis=-1)

    repeated_seqs = []
    idxs_repeated_seqs = []
    if M.any():
        matches = np.apply_along_axis(np.convolve, -1, M, np.ones((n), dtype=int))
        repeated_inds = np.count_nonzero(matches, axis=-1) > n

        if repeated_inds.any():
            repeated_matches = matches[repeated_inds]
            idxs = np.argwhere(repeated_matches > 0)
            grouped_idxs = np.split(
                idxs[:, 1], np.unique(idxs[:, 0], return_index=True)[1][1:]
            )

            # Additional routine
            # Pad this uneven array with zeros so that we can use it normally
            grouped_idxs = np.array(grouped_idxs, dtype=object)
            padded_idxs = stack_padding(grouped_idxs) 

            # Find the indices where there are padded zeros
            pad_positions = padded_idxs == 0 

            # Perform the "consecutive-numbers check" (this will take one
            # item off the original array, so we have to correct for its shape).
            idxs_to_remove= np.pad(
                (padded_idxs[:, 1:] - padded_idxs[:, :-1]) == 1,
                [(0, 0), (0, 1)],
                constant_values=True,
            ) 
            pad_positions = np.argwhere(pad_positions)
            i = pad_positions[:, 0]
            j = pad_positions[:, 1] - 1 # Shift by one (shape correction)
            idxs_to_remove[i, j] = True # Masking, since we don't want pad indices

            # Obtain a final mask (boolean opposite of indices to remove)
            final_mask = ~idxs_to_remove.all(axis=-1)
            grouped_idxs = grouped_idxs[final_mask] # Filter the long sequences

            repeated_seqs = unique_seqs[repeated_inds][final_mask]

            # In order to get the correct indices, we must first limit the
            # search to a shape (on axis=1) of the closest multiple of n.
            # This will avoid taking more indices than we should to show where 
            # each repeated sequence begins
            to = padded_idxs.shape[1] & (-n) 

            # Build the final list of indices (that goes from 0 - to with
            # a step of n
            idxs_repeated_seqs = [
                grouped_idxs[i][:to:n] for i in range(grouped_idxs.shape[0])
            ]

    return repeated_seqs, idxs_repeated_seqs

For example,

n = 2

examples = [
    # First example is your original example array.
    np.array([1, 5, 7, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),

    # Second example has a long sequence of 5's, and since there aren't
    # any [5, 5] anywhere else, it's not taken into account and therefore
    # should not come out.
    np.array([1, 5, 5, 5, 5, 6, 3, 3, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),

    # Third example has the same long sequence but since there is a [5, 5]
    # later, then it should take it into account and this sequence should
    # be found.
    np.array([1, 5, 5, 5, 5, 6, 5, 5, 7, 9, 4, 0, 3, 3, 7, 8, 1, 5]),

    # Fourth example has a [5, 5] first and later it has a long sequence of 
    # 5's which are uneven and the previous implementation got confused with
    # the indices to show as the starting indices. In this case, it should be
    # 1, 13 and 15 for [5, 5].
    np.array([1, 5, 5, 9, 4, 6, 3, 3, 7, 9, 4, 0, 3, 5, 5, 5, 5, 5]),
]

for a in examples:
    print(f"\nExample: {a}")
    repeated_seqs, inds = repeated_sequences(a, n)
    for i, seq in enumerate(repeated_seqs):
        print(f"\t{seq} with indexes {inds[i]}")

Output (as expected):

Example: [1 5 7 9 4 6 3 3 7 9 4 0 3 3 7 8 1 5]
    [1 5] with indexes [0 16]
    [3 3] with indexes [6 12]
    [3 7] with indexes [7 13]
    [7 9] with indexes [2 8]
    [9 4] with indexes [3 9]

Example: [1 5 5 5 5 6 3 3 7 9 4 0 3 3 7 8 1 5]
    [1 5] with indexes [0 16]
    [3 3] with indexes [6 12]
    [3 7] with indexes [7 13]

Example: [1 5 5 5 5 6 5 5 7 9 4 0 3 3 7 8 1 5]
    [1 5] with indexes [ 0 16]
    [5 5] with indexes [1 3 6]

Example: [1 5 5 9 4 6 3 3 7 9 4 0 3 5 5 5 5 5]
    [5 5] with indexes [ 1 13 15]
    [9 4] with indexes [3 9]

You can test it out yourself with more examples and more cases. Keep in mind this is what I understood from your disclaimer. If you want to count the long sequences as one, even if multiple sequences are in there (for example, [5, 5] appears twice in [5, 5, 5, 5]), this won't work for you and you'd have to come up with something else.

comp = n_seqs == unique_seqs[:, None] in "repeated_sequences" causes a "Deprecation warning: elementwise comparison failed; this will raise an error in the future" and stops calculating. A typo? — Anton Makarov, May 28 '21 at 09:27
I found out, it was a lack of memory on my notebook. Chopping a single array to 10 000 length leads to another mistake: "[np.convolve(M[i], np.ones((n), dtype=int)) for i in range(M.shape[0])]" drops "ValueError: object too deep for desired array" — Anton Makarov, May 28 '21 at 10:15
That warning seems to appear whenever no repeated sequences are found and makes an element-wise comparison which fails. You could maybe change the if condition to test for something else, instead of `M` to avoid the warning. This is the case where empty lists are returned. Regarding the `ValueError`, you have to make sure you're providing an array of correct shape to the function. It must be strictly 1D. That error is telling you that even after indexing `M`, `M[i]` doesn't turn out to be 1D and is therefore too deep for numpy's convolve, which only accepts 1D arrays. — Camilo Martinez M., May 28 '21 at 14:01

Find duplicated sequences in numpy.array or pandas column

1 Answers1

Disclaimer

Some tests:

EDIT: Full implementation