Fastest way to check if a sequence contains a non-consecutive subsequence?

Question

Let's say that two lists of elements are given, A and B. I'm interested in checking if A contains all the elements of B. Specifically, the elements must appear in the same order and they do not need to be consecutive. If this is the case, we say that B is a subsequence of A.

Here are some examples:

A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
B = [2, 2, 1, 3]
is_subsequence(A, B) # True

A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
B = [2, 8, 2]
is_subsequence(A, B) # True

A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
B = [2, 1, 6]
is_subsequence(A, B) # False

A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
B = [2, 7, 2]
is_subsequence(A, B) # False

I found a very elegant way to solve this problem (see this answer):

def is_subsequence(A, B):
    it = iter(A)
    return all(x in it for x in B)

I am now wondering how this solution behaves with possibly very very large inputs. Let's say that my lists contain billions of numbers.

What's the complexity of the code above? What's its worst case? I have tried to test it with very large random inputs, but its speed mostly depends on the automatically generated input.
Most importantly, are there more efficient solutions? Why are these solutions more efficient than this one?

@Austin that doesn't preserve the order requirement. I believe this has already been answered (although not in python): https://stackoverflow.com/questions/33174985/how-do-you-check-if-one-array-is-a-subsequence-of-another — sshashank124, Dec 28 '19 at 15:10
@sshashank124 I've read that answer already, and I'm indeed trying to implement those algorithms in Python to check whether they're faster than this approach. Still I am not sure about the complexity of this method (especially the worst case). — Riccardo Bucco, Dec 28 '19 at 15:12
@sshashank124: the `all(x in it for x in B)` approach is essentially Blinkenlight's algorithm, in more compact form. — Martijn Pieters, Dec 28 '19 at 15:57
If the bigger list is fixed and being searched repeatedly then I would make it a map and lookup that. — Salim, Dec 29 '19 at 03:12

Martijn Pieters · Accepted Answer · 2019-12-28T15:53:07.907

The code you found creates an iterator for A; you can see this as a simple pointer to the next position in A to look at, and in moves the pointer forward across A until a match is found. It can be used multiple times, but only ever moves forward; when using in containment tests against a single iterator multiple times, the iterator can't go backwards and so can only test if still to visit values are equal to the left-hand operand.

Given your last example, with B = [2, 7, 2], what happens is this:

it = iter(A) creates an iterator object for the A list, and stores 0 as the next position to look at.
The all() function tests each element in an iterable and returns False early, if such a result was found. Otherwise it keeps testing every element. Here the tests are repeated x in it calls, where x is set to each value in B in turn.
x is first set to 2, and so 2 in it is tested.
- it is set to next look at A[0]. That's 4, not equal to 2, so the internal position counter is incremented to 1.
- A[1] is 2, and that's equal, so 2 in it returns True at this point, but not before incrementing the 'next position to look at' counter to 2.
2 in it was true, so all() continues on.
The next value in B is 7, so 7 in it is tested.
- it is set to next look at A[2]. That's 8, not 7. The position counter is incremented to 3.
- it is set to next look at A[3]. That's 2, not 7. The position counter is incremented to 4.
- it is set to next look at A[4]. That's 7, equal to 7. The position counter is incremented to 5 and True is returned.
7 in it was true, so all() continues on.
The next value in B is 2, so 2 in it is tested.
- it is set to next look at A[5]. That's 0, not 2. The position counter is incremented to 6.
- it is set to next look at A[6]. That's 1, not 2. The position counter is incremented to 7.
- it is set to next look at A[7]. That's 5, not 2. The position counter is incremented to 8.
- it is set to next look at A[8]. That's 3, not 2. The position counter is incremented to 9.
- There is no A[9] because there are not that many elements in A, and so False is returned.
2 in it was False, so all() ends by returning False.

You could verify this with an iterator with a side effect you can observe; here I used print() to write out what the next value is for a given input:

>>> A = [4, 2, 8, 2, 7, 0, 1, 5, 3]
>>> B = [2, 7, 2]
>>> with_sideeffect = lambda name, iterable: (
    print(f"{name}[{idx}] = {value}") or value
    for idx, value in enumerate(iterable)
)
>>> is_sublist(with_sideeffect("  > A", A), with_sideeffect("< B", B))
< B[0] = 2
  > A[0] = 4
  > A[1] = 2
< B[1] = 7
  > A[2] = 8
  > A[3] = 2
  > A[4] = 7
< B[2] = 2
  > A[5] = 0
  > A[6] = 1
  > A[7] = 5
  > A[8] = 3
False

Your problem requires that you test every element of B consecutively, there are no shortcuts here. You also must scan through A to test for the elements of B being present, in the right order. You can only declare victory when all elements of B have been found (partial scan), and defeat when all elements in A have been scanned and the current value in B you are testing for is not found.

So, assuming the size of B is always smaller than A, the best case scenario then is where all K elements in B are equal to the first K elements of A. The worst case, is any case where not all of the elements of B are present in A, and require a full scan through A. It doesn't matter what number of elements are present in B; if you are testing element K out of K you already have been scanning part-way through A and must complete your scan through A to find that the last element is missing.

So the best case with N elements in A and K elements in B, takes O(K) time. The worst case, using the same definitions of N and K, takes O(N) time.

There is no faster algorithm to test for this condition, so all you can hope for is lowering your constant times (the time taken to complete each of the N steps). Here that'd be a faster way to scan through A as you search for the elements of B. I am not aware of a better way to do this than by using the method you already found.

@RiccardoBucco: O(N + K) is the same thing as O(N). At most, as K approaches N, you'd have O(2N), so a constant, and constants are simply dropped. Asymptotically speaking, there is no difference between *(value close to infinity) + infinity*, and simply *infinity*. — Martijn Pieters, Dec 28 '19 at 15:46

Fastest way to check if a sequence contains a non-consecutive subsequence?

1 Answers1