Best way to determine if a sequence is in another sequence?

Question

This is a generalization of the "string contains substring" problem to (more) arbitrary types.

Given an sequence (such as a list or tuple), what's the best way of determining whether another sequence is inside it? As a bonus, it should return the index of the element where the subsequence starts:

Example usage (Sequence in Sequence):

>>> seq_in_seq([5,6],  [4,'a',3,5,6])
3
>>> seq_in_seq([5,7],  [4,'a',3,5,6])
-1 # or None, or whatever

So far, I just rely on brute force and it seems slow, ugly, and clumsy.

Does this answer your question? [Check for presence of a sliced list in Python](https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python) — user202729, Dec 05 '21 at 18:08

Federico A. Ramponi · Accepted Answer · 2009-01-08T21:06:56.183

24

I second the Knuth-Morris-Pratt algorithm. By the way, your problem (and the KMP solution) is exactly recipe 5.13 in Python Cookbook 2nd edition. You can find the related code at http://code.activestate.com/recipes/117214/

It finds all the correct subsequences in a given sequence, and should be used as an iterator:

>>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,6]): print s
3
>>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,7]): print s
(nothing)

edited Jan 08 '09 at 21:06

answered Jan 08 '09 at 20:42

Federico A. Ramponi

46,145
29
109
133

3

Note that the KMP implementation given on code.activestate was demostrably slower by 30-500 times for some (perhaps unrepresentative input). Benchmarking to see if dumb built-in methods outperform seems to be a good idea! – James Brady Jan 09 '09 at 02:50
5

KMP is known to be about twice as slow as the naive algorithm in practice. Hence, for most purposes it’s *completely* inappropriate, despite its good asymptotic worst-case runtime. – Konrad Rudolph Oct 21 '10 at 12:51

score 12 · Answer 2 · edited May 23 '17 at 10:30

Here's a brute-force approach O(n*m) (similar to @mcella's answer). It might be faster than the Knuth-Morris-Pratt algorithm implementation in pure Python O(n+m) (see @Gregg Lind answer) for small input sequences.

#!/usr/bin/env python
def index(subseq, seq):
    """Return an index of `subseq`uence in the `seq`uence.

    Or `-1` if `subseq` is not a subsequence of the `seq`.

    The time complexity of the algorithm is O(n*m), where

        n, m = len(seq), len(subseq)

    >>> index([1,2], range(5))
    1
    >>> index(range(1, 6), range(5))
    -1
    >>> index(range(5), range(5))
    0
    >>> index([1,2], [0, 1, 0, 1, 2])
    3
    """
    i, n, m = -1, len(seq), len(subseq)
    try:
        while True:
            i = seq.index(subseq[0], i + 1, n - m + 1)
            if subseq == seq[i:i + m]:
               return i
    except ValueError:
        return -1

if __name__ == '__main__':
    import doctest; doctest.testmod()

I wonder how large is the small in this case?

score 9 · Answer 3 · answered Apr 03 '15 at 00:07

9

A simple approach: Convert to strings and rely on string matching.

Example using lists of strings:

 >>> f = ["foo", "bar", "baz"]
 >>> g = ["foo", "bar"]
 >>> ff = str(f).strip("[]")
 >>> gg = str(g).strip("[]")
 >>> gg in ff
 True

Example using tuples of strings:

>>> x = ("foo", "bar", "baz")
>>> y = ("bar", "baz")
>>> xx = str(x).strip("()")
>>> yy = str(y).strip("()")
>>> yy in xx
True

Example using lists of numbers:

>>> f = [1 , 2, 3, 4, 5, 6, 7]
>>> g = [4, 5, 6]
>>> ff = str(f).strip("[]")
>>> gg = str(g).strip("[]")
>>> gg in ff
True

answered Apr 03 '15 at 00:07

vexed

99
1
1

3

I like it! For quick & dirty stuff, anyway. Generally: ``def is_in(seq1, seq2): return str(list(seq1))[1:-1] in str(list(seq2))[1:-1]`` Not a good way to find the index of the match, I guess. – sfink Sep 21 '16 at 20:48
1

Little contrived, but this doesn't distinguish between adjacent vs combined elements so `["'a','b'"]` would show up in `["a","b"]` even though the term `'a','b'` doesn't exist in that list – KyleMit Jan 29 '23 at 20:51

score 6 · Answer 4 · answered Jan 08 '09 at 20:05

6

Same thing as string matching sir...Knuth-Morris-Pratt string matching

answered Jan 08 '09 at 20:05

nlucaroni

47,556
6
64
86

mcella · Answer 5 · 2009-01-08T20:33:46.110

>>> def seq_in_seq(subseq, seq):
...     while subseq[0] in seq:
...         index = seq.index(subseq[0])
...         if subseq == seq[index:index + len(subseq)]:
...             return index
...         else:
...             seq = seq[index + 1:]
...     else:
...         return -1
... 
>>> seq_in_seq([5,6], [4,'a',3,5,6])
3
>>> seq_in_seq([5,7], [4,'a',3,5,6])
-1

Sorry I'm not an algorithm expert, it's just the fastest thing my mind can think about at the moment, at least I think it looks nice (to me) and I had fun coding it. ;-)

Most probably it's the same thing your brute force approach is doing.

It is nice an clean, but brute-forcy --> O(mn) – Gregg Lind Jan 08 '09 at 20:57 — Gregg Lind, Jan 08 '09 at 20:57

score 2 · Answer 6 · answered Jan 08 '09 at 19:51

2

Brute force may be fine for small patterns.

For larger ones, look at the Aho-Corasick algorithm.

answered Jan 08 '09 at 19:51

Doug Currie

40,708
1
95
119

Aho-Corasick would be great. I'm specifically looking for python, or pythonish solutions... so if there were an implementation, that would be great. I'll poke around. – Gregg Lind Jan 08 '09 at 20:14

score 2 · Answer 7 · edited May 29 '12 at 18:27

Here is another KMP implementation:

from itertools import tee

def seq_in_seq(seq1,seq2):
    '''
    Return the index where seq1 appears in seq2, or -1 if 
    seq1 is not in seq2, using the Knuth-Morris-Pratt algorithm

    based heavily on code by Neale Pickett <neale@woozle.org>
    found at:  woozle.org/~neale/src/python/kmp.py

    >>> seq_in_seq(range(3),range(5))
    0
    >>> seq_in_seq(range(3)[-1:],range(5))
    2
    >>>seq_in_seq(range(6),range(5))
    -1
    '''
    def compute_prefix_function(p):
        m = len(p)
        pi = [0] * m
        k = 0
        for q in xrange(1, m):
            while k > 0 and p[k] != p[q]:
                k = pi[k - 1]
            if p[k] == p[q]:
                k = k + 1
            pi[q] = k
        return pi

    t,p = list(tee(seq2)[0]), list(tee(seq1)[0])
    m,n = len(p),len(t)
    pi = compute_prefix_function(p)
    q = 0
    for i in range(n):
        while q > 0 and p[q] != t[i]:
            q = pi[q - 1]
        if p[q] == t[i]:
            q = q + 1
        if q == m:
            return i - m + 1
    return -1

The `tee` calls don't seem to be good for anything since the other element in tee's output 2-tuple is ignored. `seq1` and `seq2` are each copied to two new generators, one of which gets instantiated into a list, and the other of which gets ignored. — j0057, Dec 14 '18 at 21:29

Michael Anderson · Answer 8 · 2016-10-27T23:37:48.267

1

I'm a bit late to the party, but here's something simple using strings:

>>> def seq_in_seq(sub, full):
...     f = ''.join([repr(d) for d in full]).replace("'", "")
...     s = ''.join([repr(d) for d in sub]).replace("'", "")
...     #return f.find(s) #<-- not reliable for finding indices in all cases
...     return s in f
...
>>> seq_in_seq([5,6], [4,'a',3,5,6])
True
>>> seq_in_seq([5,7], [4,'a',3,5,6])
False
>>> seq_in_seq([4,'abc',33], [4,'abc',33,5,6])
True

As noted by Ilya V. Schurov, the find method in this case will not return the correct indices with multi-character strings or multi-digit numbers.

edited Oct 27 '16 at 23:37

answered Oct 27 '16 at 20:21

Michael Anderson

149
4

This solution is not reliable in case elements of sequences have non-unique lenghs: it become not obvious how to translate index returned to index in initial sequences. Note also that backtick for `\`d\`` syntax is deprecated as for Python 3 and discouraged. – Ilya V. Schurov Oct 27 '16 at 21:29
example of non reliability even with all same sizes : sub='ab', full='aa','bb' – makapuf Jul 07 '17 at 15:31

score 0 · Answer 9 · answered Dec 30 '20 at 06:46

For what it's worth, I tried using a deque like so:

from collections import deque
from itertools import islice

def seq_in_seq(needle, haystack):
    """Generator of indices where needle is found in haystack."""
    needle = deque(needle)
    haystack = iter(haystack)  # Works with iterators/streams!
    length = len(needle)
    # Deque will automatically call deque.popleft() after deque.append()
    # with the `maxlen` set equal to the needle length.
    window = deque(islice(haystack, length), maxlen=length)
    if needle == window:
        yield 0  # Match at the start of the haystack.
    for index, value in enumerate(haystack, start=1):
        window.append(value)
        if needle == window:
            yield index

One advantage of the deque implementation is that it makes only a single linear pass over the haystack. So if the haystack is streaming then it will still work (unlike the solutions that rely on slicing).

The solution is still brute-force, O(n*m). Some simple local benchmarking showed it was ~100x slower than the C-implementation of string searching in str.index.

score -1 · Answer 10 · answered Dec 03 '15 at 06:36

-1

Another approach, using sets:

set([5,6])== set([5,6])&set([4,'a',3,5,6])
True

answered Dec 03 '15 at 06:36

miguelghz

375
3
6

Merely finds out whether the set is a subset of the sequence. Not whether it's actually in that order in the sequence. ``set([5,6])== set([5,6])&set([4,'a',5,4,6])`` returns ``True`` – Jonas Lindeløv Feb 14 '16 at 20:52
that could be a first, fast test however : check that all elements are in the full list. – makapuf Jul 07 '17 at 15:33
But it is not obvious that this “fast test” would be faster than the real test… At least asymptotically, it is slower: if m and n are the lengths of the small and large sequences, building sets and testing for inclusion is O(log(m+n)) whereas testing subsequence inclusion is O(m+n) with a clever algorithm like KMP. For small sequences, I also suspect that just the naive subsequence algorithm (O(m×n)) would be faster in practice than a subset test, as it builds no object and thus has no overhead. – Maëlan Feb 28 '23 at 10:30
Typo: I meant “building sets and testing for inclusion is **O((m+n)×log(m+n))**”. – Maëlan Jul 11 '23 at 00:21

Best way to determine if a sequence is in another sequence?

10 Answers10

Linked

Related