7

How do I create a function sublist() that takes two lists, list1 and list2, and returns True if list1 is a sublist of list2, and False otherwise. list1 is a sublist of list2 if the numbers in list1 appear in list2 in the same order as they appear in list1, but not necessarily consecutively. For example,

>>> sublist([1, 12, 3],[25, 1, 30, 12, 3, 40])
True

>>> sublist([5, 90, 2],[90, 20, 5, 2, 17])
False
Stefan van den Akker
  • 6,661
  • 7
  • 48
  • 63
user3638865
  • 119
  • 1
  • 2
  • 8
  • Within some complexity requirements? – Ryan Haining May 15 '14 at 01:25
  • 2
    Can the items in the list repeat? – grdvnl May 15 '14 at 01:45
  • I initially thought your question was almost identical to this one: http://stackoverflow.com/questions/3847386/testing-if-a-list-contains-another-list-with-python, but upon examination, it's not. Please consider revising your question to avoid it being closed a as a duplicate and to make it simpler for other users trying to do the same thing to find. – jpmc26 May 15 '14 at 02:06
  • @user3638865, I've noticed you've posted 6 questions -- none of which have been marked as "answered". Choose an answer by clicking the "checkmark" next to the answer you deem fit. – bcorso Jun 01 '14 at 17:13

10 Answers10

8

Here's one way to do it in linear time (and constant space) with an iterator:

def sublist(a, b):
    seq = iter(b)
    try:
        for x in a:
            while next(seq) != x: pass
        else:
            return True
    except StopIteration:
        pass
    return False

Basically it goes through each element of the sublist, and sees if it can find that same element in the part of the complete list it hasn't looked at yet. If it makes it through the entire sublist it means we have a match (hence the else statement on the for loop). If we run out of elements to look at in the complete list, it means we don't have a match.

Edit: I have updated my solution so it works with Python 3. For Python 2.5 and older, next(seq) needs to be replaced with seq.next().

hetman
  • 919
  • 6
  • 16
  • 2
    At least on Python 3, this needs to be `seq.__next__()` - `iter` over a list returns an object which does not have a `next` method, just `__next__`. – Peter DeGlopper May 15 '14 at 02:59
  • See my answer for some not terribly rigorous performance testing showing that this is faster than my answer or the recursive version. I didn't test @sshashank124's but as you pointed out copying lists around will have costs. – Peter DeGlopper May 15 '14 at 03:28
  • Oh, of course - `next(seq)` rather than `seq.__next__()`. Yeah, this is the way to go. – Peter DeGlopper May 15 '14 at 04:48
  • If someone needs to micro-optimize this, I suspect `itertools.dropwhile(x.__ne__, seq)` might be faster than an explicit loop (because it iterates in C), or a `for` loop with a `break` (because it iterates with fewer bytecodes) might be faster than a `while` loop around `next`, especially in CPython 2.x. I doubt it will ever matter for most real code, but if it does, there's two things to test. :) – abarnert Jun 11 '15 at 22:37
5

A very rough solution:

def sublist(a, b):
    if not a:
        return True
    for k in range(len(b)):
        if a[0] == b[k]:
            return sublist(a[1:], b[k+1:])
    return False

print sublist([1, 12, 3], [25, 1, 30, 12, 3, 40]) # True
print sublist([12, 1, 3], [25, 1, 30, 12, 3, 40]) # False

Edit: Speed upgrade

huu
  • 7,032
  • 2
  • 34
  • 49
  • Very rough but I believe the first one on here that's correct for all cases. But ugh that execution time! :O. Still +1, it beats mine for comprehensiveness. – Adam Smith May 15 '14 at 01:41
  • I think at most my solution will read each array once. Correct me if I'm wrong, but that's the fastest this can get right? – huu May 15 '14 at 01:51
  • I want to profile this vs my iterative implementation but this looks both correct and elegant. – Peter DeGlopper May 15 '14 at 02:22
  • Don't work it too hard or you'll hit a recursive depth limit :P – huu May 15 '14 at 02:24
  • Yeah, it's a cool demo, but not very practical for real applications. The big problems here are firstly, that it's creating copies of sublists, which will make it very slow and memory hungry the bigger the input becomes. And secondly, the recursive depth limit you pointed out ;) – hetman May 15 '14 at 02:40
2

Here is a simplified version:

def sublist(a,b):
    try:
        return a[0] in b and sublist(a[1:],b[1+b.index(a[0]):])
    except IndexError:
        return True

>>> print sublist([1, 12, 3],[25, 1, 30, 12, 3, 40])
True

>>> print sublist([5, 90, 2],[90, 20, 5, 2, 17])
False
sshashank124
  • 31,495
  • 9
  • 67
  • 76
2

Here's an iterative solution which should have optimal asymptotics:

def sublist(x, y):
    if x and not y:
        return False
    i, lim = 0, len(y)
    for e in x:
        while e != y[i]:
            i += 1
            if i == lim:
                return False
        i += 1
    return True

@sshashank124's solution has the same complexity, but the dynamics will be somewhat different: his version traverses the second argument multiple times, but because it pushes more work into the C layer it'll probably be much faster on smaller input.

Edit: @hetman's solution has essentially the same logic, but is much more Pythonic, although, contrary to my expectation, it seems to be slightly slower. (I was also incorrect about the performance of @sshashan124's solution; the overhead of the recursive calls appears to outweigh the benefit of doing more work in C.)

Alp
  • 2,766
  • 18
  • 13
  • Are you sure about @sshashank124's solution having the same complexity? It's creating copies of sublists which means the space complexity is certainly going to be higher. – hetman May 15 '14 at 02:34
  • I was thinking only of time complexity. – Alp May 15 '14 at 02:35
2

Congrats on a deceptively hard question. I think this will work but I will not be shocked if I missed a corner case, especially with repeated elements. Revised version inspired by Hgu Nguyen's recursive solution:

def sublist(a, b):
    index_a = 0
    index_b = 0
    len_a = len(a)
    len_b = len(b)
    while index_a < len_a and index_b < len_b:
        if a[index_a] == b[index_b]:
            index_a += 1
            index_b += 1
        else:
            index_b += 1
    return index_a == len_a

Some rough profiling:

Given lists that require traversing most or all of b, my algorithm suffers:

a = [1, 3, 999999]
b = list(range(1000000))

On my PC, Huu Nguyen or Hetman's algorithm takes about 10 seconds to run 100 iterations of the check. My algorithm takes 20 seconds.

Given an earlier success, Huu's algorithm falls vastly behind:

a = [1, 3, 5]

Hetman's algorithm or mine can complete 100k checks in under a second - Hetman's in 0.13 seconds on my PC, mine in 0.19 seconds. Huu's takes 16 seconds to complete 1k checks. I am frankly astounded at that degree of difference - recursion can be slow if not compiler optimized, I know, but 4 orders of magnitude is worse than I would have expected.

Given a failure list a, the performance heads back towards what I saw when requiring traversal of the whole second list - understandable, since there's no way to know that there won't be a sequence at the end that matches the otherwise unmatchable list.

a = [3, 1, 5]

Again, about 10 seconds for Huu Nguyen or Hetman's algorithm for 100 tests, 20 for mine.

Longer ordered lists maintain the pattern I saw for early success. EG:

a = range(0, 1000, 20)

With Hetman's algorithm that took 10.99 seconds to complete 100k tests, while mine took 24.08. Huu's took 28.88 to complete 100 tests.

These are admittedly not the full range of tests you could run, but in all cases Hetman's algorithm performed the best.

Community
  • 1
  • 1
Peter DeGlopper
  • 36,326
  • 7
  • 90
  • 83
  • Excellent analysis. Yeah, I had a feeling mine wasn't quite optimal. It's nice to see a thorough comparison for different cases here. – huu May 15 '14 at 05:13
  • 1
    @HuuNguyen - I can only think that in the early success case, the time spent copying most of `b` is crippling the recursive version. It should only have to do it three times in my simple test case (after matching 1, 3 and 5) but it seems that adds up. You could probably optimize some by passing a starting index k into b and passing an uncopied list reference in to the recursive call, then doing `for k in range(starting_index, len(b)):` - but I expect @Hetman's still beats it. A long list `b` highlights the copying cost, but overall I think it's O(a*b) since you copy parts of b `len(a)` times. – Peter DeGlopper May 15 '14 at 05:50
1

How about this: let's approach this from the other side:

def sublist(a,b):
    """returns True if a is contained in b and in the same order"""
    return a == [ch for ch in b if ch in a]

This will fail in some circumstances (e.g. should [1,2,3] be a subset of [1,1,8,2,3]) but it's hard to say exactly how you want this implemented...

Adam Smith
  • 52,157
  • 12
  • 73
  • 112
1

For a quick-and-dirty solution that runs slowly, but will be totally adequate for arrays of the size you showed:

def sublist(a,b):
    last = 0
    for el_a in a:
        if el_a in b[last:]:
             last = b[last:].index(el_a)
        else:
             return False
    return True

**Edited to work for non-contiguous elements

John Haberstroh
  • 465
  • 4
  • 11
1

Here's another solution that may be easier for novices to understand than Hetman's. (Notice that it's very close to the OP's implementation in this duplicate question, but avoiding the problem of restarting the search from the start of b each time.)

def sublist(a, b):
    i = -1
    try:
        for e in a:
            i = b.index(e, i+1)
    except ValueError:
        return False
    else:
        return True

Of course this requires b to be a list, while Hetman's answer allows any iterable. And I think that (for people who understand Python well enough) it's less simple than Hetman's answer, too.

Algorithmically, it's doing the same thing as Hetman's answer, so it's O(N) time and O(1) space. But practically, it may be faster, at least in CPython, since we're moving the inner loop from a Python while around an iterator to a C fast-getindex loop (inside list.index). Then again, it may be slower, because we're copying around that i value instead of having all state embedded inside a (C-implemented) iterator. If it matters, test them both with your real data. :)

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671
0

here is a better solution using regex:

import re


def exiSt(short,long):
    r = ''.join(["(.*"+str[x]+")" for x in short])
    return re.match(r,','.join([str(x) for x in long])) == None

long = [1, 2, 3, 4, 5]
short1 = [1,2,5]
short2 = [1,5,3]

exiSt(short1,long)
>> True

exiSt(short2,long)
>> False
farhawa
  • 10,120
  • 16
  • 49
  • 91
0
def sublist(a, b):
    "if set_a is not subset of set_b then obvious answer is False"
    if not set(a).issubset(set(b)):
        return False
    n = 0
    for i in a:
        if i in b[n:]:
            "+1 is needed to skip consecutive duplicates, i.e. sublist([2,1,1],[2,1]) = False"
            "if +1 is absent then sublist([2,1,1],[2,1]) = True"
            "choose to use or not to use +1 according to your needs"
            n += b[n:].index(i) + 1
        else:
            return False
    return True
Trigremm
  • 61
  • 7