24

How to test if one string is a subsequence of another?

This is a weaker condition than being a substring. For example 'iran' is not a substring of 'ireland', but it is a subsequence IRelANd. The difference is a subsequence doesn't have to be contiguous.

More examples:

  • 'indonesia' contains 'india'. INDonesIA
  • 'romania' contains 'oman'. rOMANia
  • 'malawi' contains 'mali'. MALawI

Movitation: My friends like word games. Yesterday we played 'countries within countries'. I am curious if there are any pairs we missed.

Edit: If you aren't familiar with the mathematical definition of subsequence

A subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements

Colonel Panic
  • 132,665
  • 89
  • 401
  • 465
  • 1
    @wim The deduplication direction looks wrong to me. This one is the older question and has more views and more votes than that other question that this is supposed to be a duplicate of. – Stefan Pochmann Sep 14 '21 at 11:53
  • @StefanPochmann Took another look, it's a close call but I still think the Q&A on other one are better/clearer. Correctness & conciseness trumps age/views/votes. But if the top answer here were edited to remove extra noise, I wouldn't lose any sleep over the duplicate direction getting swapped. – wim Sep 14 '21 at 19:00

4 Answers4

52
def is_subseq(x, y):
    it = iter(y)
    return all(any(c == ch for c in it) for ch in x)

assert is_subseq('india', 'indonesia')
assert is_subseq('oman', 'romania')
assert is_subseq('mali', 'malawi')
assert not is_subseq('mali', 'banana')
assert not is_subseq('ais', 'indonesia')
assert not is_subseq('ca', 'abc')

Also works for any iterables:

assert is_subseq(['i', 'n', 'd', 'i', 'a'],
                 ['i', 'n', 'd', 'o', 'n', 'e', 's', 'i', 'a'])

UPDATE

Stefan Pochmann suggested this.

def is_subseq(x, y):
    it = iter(y)
    return all(c in it for c in x)

Both versions uses iterators; Iterator yields items that was not yielded in previous iteration.

For example:

>>> it = iter([1,2,3,4])
>>> for x in it:
...     print(x)
...     break
...
1
>>> for x in it:  # `1` is yielded in previous iteration. It's not yielded here.
...     print(x)
...
2
3
4
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • What about 'ca' in 'abc'? – user189 Jun 03 '14 at 14:27
  • 1
    @user189, According to the quote in the question, **order** is important; `ca` is not a subsequence of `abc`. – falsetru Jun 03 '14 at 14:28
  • 3
    Ok, misunderstood the use of an iterator here, seems quite clever. – user189 Jun 03 '14 at 14:30
  • 11
    What do you think of `all(c in it for c in x)`? Does something speak against it? – Stefan Pochmann Jul 05 '16 at 10:43
  • @StefanPochmann, Nice. I updated the answer to include your version. Thank you for your suggestion. – falsetru Jul 05 '16 at 11:22
  • @falsetru Ok, I guess that means you see nothing against it :-). I find it a bit more magical and would like to see how it's implemented. [This section](https://docs.python.org/3/reference/expressions.html#membership-test-operations) looks relevant, but doesn't mention iterators. But I guess it'll be like **"For user-defined classes which do not define `__contains__()` but do define `__iter__()`, `x in y` is true if some value z with x == z is produced while iterating over y"** and thus do the same as your original. Is that right? – Stefan Pochmann Jul 05 '16 at 16:15
  • @StefanPochmann, Yes, it is. – falsetru Jul 05 '16 at 16:32
  • @StefanPochmann can you explain how 'ca' in 'abc' doesn't pass with your implementation? `'c' in 'abc' and 'a' in 'abc'`is true – user2361174 Feb 10 '17 at 00:21
  • @user2361174 What do you mean it doesn't pass my implementation? You mean why 'ca' isn't determined to be a subsequence of 'abc'? Because it **isn't** a subsequence of that. – Stefan Pochmann Feb 10 '17 at 06:35
  • @StefanPochmann How does `'ac' in 'abc'` manages to pass in your code but `'ca' in 'abc'` fails? – user2361174 Feb 11 '17 at 03:40
  • @user2361174 'ca' fails because when the iterator finds the 'c', it's already at the end of the string. There's nothing left. Particularly no 'a'. – Stefan Pochmann Feb 11 '17 at 07:27
  • 1
    @user2361174, `it` in the answer is an iterator. Item that is iterated is not iterated again. (consumed) – falsetru Feb 11 '17 at 07:44
  • @StefanPochmann Can you explain why need `it = iter(y)`: 1. why we need to convert y to an iterator 2. If we need, why we do not need to convert x to an iterator? – cxs1031 May 04 '17 at 19:31
  • 1
    Just brilliant!! took a while to grok this, but I love it – RoyM Jan 23 '21 at 20:33
  • The performance of this approach when using strings is quite poor due to iterating every character in a Python loop. I've added some timings in the accepted answer. – wim Sep 14 '21 at 19:46
7

Just keep on looking for the next character of your potential subsequence, starting behind the last found one. As soon as one of the characters can't be found in the remainder of the string, it's no subsequence. If all characters can be found this way, it is:

def is_subsequence(needle, haystack):
    current_pos = 0
    for c in needle:
        current_pos = haystack.find(c, current_pos) + 1
        if current_pos == 0:
            return False
    return True

An advantage over the top answer is that not every character needs to be iterated in Python here, since haystack.find(c, current_pos) is looping in C code. So this approach can perform significantly better in the case that the needle is small and the haystack is large:

>>> needle = "needle"
>>> haystack = "haystack" * 1000
>>> %timeit is_subseq(needle, haystack)
296 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit is_subsequence(needle, haystack)
334 ns ± 1.51 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
wim
  • 338,267
  • 99
  • 616
  • 750
KillianDS
  • 16,936
  • 4
  • 61
  • 70
0

I did

def is_subsequence(x, y):
    """Test whether x is a subsequence of y"""
    x = list(x)
    for letter in y:
        if x and x[0] == letter:
            x.pop(0)

    return not x
Colonel Panic
  • 132,665
  • 89
  • 401
  • 465
  • 7
    Deleting from the left of a list is awfully inefficient, especially if you do it in a loop. And it's completely unnecessary here; you could/should just use an iterator instead. – Aran-Fey Oct 29 '18 at 15:48
-3
def subsequence(seq, subseq):
    seq = seq.lower()
    subseq = subseq.lower()
    for char in subseq:
        try:      
            seq = seq[seq.index(char)+1:]            
        except:
            return False
    return True
Josip Grggurica
  • 421
  • 4
  • 12
  • 2
    Repeatedly slicing the input sequence is needlessly inefficient; using an iterator (or index) would be much better. – Aran-Fey Oct 29 '18 at 16:00