39

Given a list containing a known pattern surrounded by noise, is there an elegant way to get all items that equal the pattern. See below for my crude code.

list_with_noise = [7,2,1,2,3,4,2,1,2,3,4,9,9,1,2,3,4,7,4,3,1,2,3,5]
known_pattern = [1,2,3,4]
res = []


for i in list_with_noise:
    for j in known_pattern:
        if i == j:
            res.append(i)
            continue

print res

we would get 2, 1, 2, 3, 4, 2, 1, 2, 3, 4, 1, 2, 3, 4, 4, 3

bonus: avoid appending i if the full pattern is not present (ie., allow 1,2,3,4 but not 1,2,3)

examples:

find_sublists_in_list([7,2,1,2,3,4,2,1,2,3,4,9,9,1,2,3,4,7,4,3,1,2,3,5],[1,2,3,4])

[1,2,3,4],[1,2,3,4],[1,2,3,4]


find_sublists_in_list([7,2,1,2,3,2,1,2,3,6,9,9,1,2,3,4,7,4,3,1,2,6],[1,2,3,4])

[1,2,3],[1,2,3],[1,2,3]

The lists contain named tuples.

Rich Tier
  • 9,021
  • 10
  • 48
  • 71
  • 10
    Feast your eyes on http://en.wikipedia.org/wiki/String_searching_algorithm – YXD Apr 11 '12 at 13:30
  • 1
    You might want to give some examples of inputs and the corresponding expected outputs. Your question in its present form is not clear. – NPE Apr 11 '12 at 13:33
  • Is this a question or a programming exercise? – Ferdinand Beyer Apr 11 '12 at 13:33
  • 2
    @FerdinandBeyer I don't understand the difference. I currently Have the above code but want something "nicer" – Rich Tier Apr 11 '12 at 13:36
  • 1
    Let me clarify, do you want to get those entries from `list_with_nose` which are in `known_patterns` in the same order? – bereal Apr 11 '12 at 13:38
  • @rikAtee: What about overlapping patterns (say you look for [1, 1] in [1, 1, 1, 1]): what do you want the code to return: 2 or 3 matches? – Eric O. Lebigot Apr 11 '12 at 13:54
  • @rikAtee: Another question: what kind of elements does your list with noise contain? some fast and simple solutions might be available if the elements are simple enough (like in the regular expression solution proposed by Roman Bodnarchuk). – Eric O. Lebigot Apr 11 '12 at 13:55
  • @EOL [1, 1] in [1, 1, 1, 1] doesn't matter in thsi case. the list with noise will not be so simple. The lists contain named tuples, unfortunately. – Rich Tier Apr 11 '12 at 14:02
  • Does this answer your question? [Python/NumPy first occurrence of subarray](https://stackoverflow.com/questions/7100242/python-numpy-first-occurrence-of-subarray) – norok2 Mar 23 '20 at 23:52

7 Answers7

54

I know this question is 5 months old and already "accepted", but googling a very similar problem brought me to this question and all the answers seem to have a couple of rather significant problems, plus I'm bored and want to try my hand at a SO answer, so I'm just going to rattle off what I've found.

The first part of the question, as I understand it, is pretty trivial: just return the original list with all the elements not in the "pattern" filtered out. Following that thinking, the first code I thought of used the filter() function:

def subfinder(mylist, pattern):
    return list(filter(lambda x: x in pattern, mylist))

I would say that this solution is definitely more succinct than the original solution, but it's not any faster, or at least not appreciably, and I try to avoid lambda expressions if there's not a very good reason for using them. In fact, the best solution I could come up with involved a simple list comprehension:

def subfinder(mylist, pattern):
    pattern = set(pattern)
    return [x for x in mylist if x in pattern]

This solution is both more elegant and significantly faster than the original: the comprehension is about 120% faster than the original, while casting the pattern into a set first bumps that up to a whopping 320% faster in my tests.

Now for the bonus: I'll just jump right into it, my solution is as follows:

def subfinder(mylist, pattern):
    matches = []
    for i in range(len(mylist)):
        if mylist[i] == pattern[0] and mylist[i:i+len(pattern)] == pattern:
            matches.append(pattern)
    return matches

This is a variation of Steven Rumbalski's "inefficient one liner", that, with the addition of the "mylist[i] == pattern[0]" check and thanks to python's short-circuit evaluation, is significantly faster than both the original statement and the itertools version (and every other offered solution as far as I can tell) and it even supports overlapping patterns. So there you go.

mintchkin
  • 1,564
  • 2
  • 14
  • 13
  • 1
    nice! I would do for i in xrange( len(mylist) - len(patern) +1 ) though to avoid needless comparisons: e.g., if len(mylist) was 1000, and len(pattern)was 50 then when i = 951 then len(mylist[i+len(pattern)]) will be 49, and len(pattern) would be 50 and so there is no way that can be matched – Rich Tier Sep 28 '12 at 23:49
  • also, list comprehension'd it – Rich Tier Sep 29 '12 at 00:05
  • 2
    I found that the second example also found sub-lists that were out of order. – John McGehee Feb 25 '17 at 20:33
  • 1
    I think I would return a list of indices into mylist where matches were found. Returning copies of the pattern is not helpful since we know what the pattern is. Good solution though. Thanks! – theoden Feb 08 '19 at 23:46
  • Note that this is optimized for relatively short search patterns. At root, this sort of algorithm should be optimized to limit visits per node in the search area. As you increase the length (and/or repetition) of the search string, you will increase the number of repeat visits to nodes. For instance, `[1,2,1,2,3] in [1,2,1,2,1,2,3]` will result in 21 visits, while my overlap solution would only do 13. Where the visit count starts to give worse perf than the optimizations from dropping in to C is is an open question though. – Silas Ray Feb 28 '19 at 20:28
  • Oh, also, note that list slicing negates the short-circuit benefit, since slicing materializes the entire slice in memory, so you have a minimum time complexity here of len_of_search * len_of_target. You could use `itertools.islice()`, but then you'd have to switch out the `==` for something that would work with an iterator. – Silas Ray Mar 01 '19 at 17:14
  • You can optimize it by convert the loop into ```while``` loop instead and increase the index by the length of the pattern if we have a match, else by one. – Sido4odus Dec 16 '21 at 08:48
5

This will get the "bonus" part of your question:

pattern = [1, 2, 3, 4]
search_list = [7,2,1,2,3,4,2,1,2,3,4,9,9,1,2,3,4,7,4,3,1,2,3,5]
cursor = 0
found = []
for i in search_list:
    if i == pattern[cursor]:
        cursor += 1
        if cursor == len(pattern):
            found.append(pattern)
            cursor = 0
    else:
        cursor = 0

For non-bonus:

pattern = [1, 2, 3, 4]
search_list = [7,2,1,2,3,4,2,1,2,3,4,9,9,1,2,3,4,7,4,3,1,2,3,5]
cursor = 0
found = []
for i in search_list:
    if i != pattern[cursor]:
        if cursor > 0:
            found.append(pattern[:cursor])
        cursor = 0
    else:
        cursor += 1

Finally, this one handles overlaps:

def find_matches(pattern_list, search_list):
    cursor_list = []
    found = []
    for element in search_list:
        cursors_to_kill = []
        for cursor_index in range(len(cursor_list)):
            if element == pattern_list[cursor_list[cursor_index]]:
                cursor_list[cursor_index] += 1
                if cursor_list[cursor_index] == len(pattern_list):
                    found.append(pattern_list)
                    cursors_to_kill.append(cursor_index)
            else:
                cursors_to_kill.append(cursor_index)
        cursors_to_kill.reverse()
        for cursor_index in cursors_to_kill:
            cursor_list.pop(cursor_index)
        if element == pattern_list[0]:
            cursor_list.append(1)
    return found
Silas Ray
  • 25,682
  • 5
  • 48
  • 63
  • @rikAtee Note that this solution will return 2, not 3 results in the overlap condition mentioned by EOL in the comments to the question. – Silas Ray Apr 11 '12 at 13:59
  • @rikAtee, no, we aren't appending in real time, we are only appending once we've found a complete match, so we need to append the entire pattern. – Silas Ray Apr 11 '12 at 14:00
  • My answer gets downvoted (without even a comment) while the answer that actually introduces new bugs by unnecessarily and jankily converting the entire list to a string gets 3 upvotes? SO is a weird place sometimes... – Silas Ray Apr 11 '12 at 14:09
  • indeed, but it also got a big green tick next to it! – Rich Tier Apr 11 '12 at 14:13
  • @sr2222: No downvote, but the bonus solution is could be made simpler by simply *counting* the number of occurrences of the pattern. If necessary, the result can then simply be given as `pattern * n_occurrences`. – Eric O. Lebigot Apr 11 '12 at 14:37
  • @EOL, Actually, to match OP's expected list of list results in his edited question, you need append. Doing the list multiply will return you the result in his initially posted question, but I'm going with his clarification. – Silas Ray Apr 11 '12 at 15:18
  • 1
    This is actually a imperfect solution. For instance, if you are searching for `[1, 2, 1, 2, 3]` in `[1, 2, 1, 2, 1, 2, 3]`, it will miss the match. But for any non-repeating match patterns, this solution will work fine. If you want to handle repeating match patterns, the most efficient way to do it involves pre-processing the search string for the repeating sections and being able to have the loop track multiple concurrent overlapping matching subsections. – Silas Ray Apr 11 '12 at 20:47
  • @SilasRay can you explain a bit more what you did on your first program, please? – Andy K Jul 19 '16 at 12:36
3

An iterator-based approach that is still based on the naive algorithm, but tries to do as much implicit looping as possible making use of .index():

def find_pivot(seq, subseq):
    n = len(seq)
    m = len(subseq)
    stop = n - m + 1
    if n > 0:
        item = subseq[0]
        i = 0
        try:
            while i < stop:
                i = seq.index(item, i)
                if seq[i:i + m] == subseq:
                    yield i
                i += 1
        except ValueError:
            return

compared to a couple of others approaches with various degrees of explicit looping:

def find_loop(seq, subseq):
    n = len(seq)
    m = len(subseq)
    for i in range(n - m + 1):
        if all(seq[i + j] == subseq[j] for j in (range(m))):
            yield i
def find_slice(seq, subseq):
    n = len(seq)
    m = len(subseq)
    for i in range(n - m + 1):
        if seq[i:i + m] == subseq:
            yield i
def find_mix(seq, subseq):
    n = len(seq)
    m = len(subseq)
    for i in range(n - m + 1):
        if seq[i] == subseq[0] and seq[i:i + m] == subseq:
            yield i

one would get:

a = list(range(10))
print(a)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

b = list(range(5, 10))
print(b)
# [5, 6, 7, 8, 9]

funcs = find_pivot, find_loop, find_slice, find_mix, 
for func in funcs:
    print()
    print(func.__name__)
    print(list(func(a * 10, b)))
    aa = a * 100
    %timeit list(func(aa, b))
    random.shuffle(aa)
    %timeit list(func(aa, b))

# find_pivot
# [5, 15, 25, 35, 45, 55, 65, 75, 85, 95]
# 10000 loops, best of 3: 49.6 µs per loop
# 10000 loops, best of 3: 50.1 µs per loop

# find_loop
# [5, 15, 25, 35, 45, 55, 65, 75, 85, 95]
# 1000 loops, best of 3: 712 µs per loop
# 1000 loops, best of 3: 680 µs per loop

# find_slice
# [5, 15, 25, 35, 45, 55, 65, 75, 85, 95]
# 10000 loops, best of 3: 162 µs per loop
# 10000 loops, best of 3: 162 µs per loop

# find_mix
# [5, 15, 25, 35, 45, 55, 65, 75, 85, 95]
# 10000 loops, best of 3: 82.2 µs per loop
# 10000 loops, best of 3: 83.9 µs per loop


Note that this is ~30% faster than the currently accepted answer with the tested input.

norok2
  • 25,683
  • 4
  • 73
  • 99
1
list_with_noise = [7,2,1,2,3,4,2,1,2,3,4,9,9,1,2,3,4,7,4,3,1,2,3,5]
string_withNoise = "".join(str(i) for i in list_with_noise)
known_pattern = [1,2,3,4]
string_pattern = "".join(str(i) for i in known_pattern)
string_withNoise.count(string_pattern)
user850498
  • 717
  • 1
  • 9
  • 22
0

Given:

a_list = [7,2,1,2,3,4,2,1,2,3,4,9,9,1,2,3,4,7,4,3,1,2,3,5]
pat = [1,2,3,4]

Here is an inefficient one liner:

res = [pat for i in range(len(a_list)) if a_list[i:i+len(pat)] == pat]

Here is a more efficient itertools version:

from itertools import izip_longest, islice

res = []
i = 0  

while True:
    try:
        i = a_list.index(pat[0], i)
    except ValueError:
        break
    if all(a==b for (a,b) in izip_longest(pat, islice(a_list, i, i+len(pat)))):
        res.append(pat)
        i += len(pat)
    i += 1
Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119
  • Why not replace `all(a==b…)` simply by `pat == a_list[i:i+len(pat)]`? This would also make your code more robust, as list `a_list` might contain `None` elements that would break the `izip_longest()` test (since it yields `None` for missing elements). – Eric O. Lebigot Apr 12 '12 at 05:46
  • @EOL The idea behind using `all` is to avoid the creation of sublists. The only risk with `None` is if `pat` contains `None`, in which case the `fillvalue` parameter to `izip_longest` can be used. `izip` was not used because to avoid matching a partial pattern at the end of the list. – Steven Rumbalski Apr 12 '12 at 14:30
0

An idiomatic, composable solution to the problem.

First off, we need to borrow an itertools recipe, consume (which consumes and discards a given number of elements from an iterator. Then we take the itertools recipe for pairwise, and extend it to an nwise function using consume:

import itertools

def nwise(iterable, size=2):
    its = itertools.tee(iterable, size)
    for i, it in enumerate(its):
        consume(it, i)  # Discards i elements from it
    return zip(*its)

Now that we have that, solving the bonus problem is really easy:

def find_sublists_in_list(biglist, searchlist):
    searchtup = tuple(searchlist)
    return [list(subtup) for subtup in nwise(biglist, len(searchlist)) if subtup == searchtup]

    # Or for more obscure but faster one-liner:
    return map(list, filter(tuple(searchlist).__eq__, nwise(biglist, len(searchlist))))

Similarly, a more succinct and speedier (if somewhat less pretty) solution to the main problem replaces:

def subfinder(mylist, pattern):
    pattern = set(pattern)
    return [x for x in mylist if x in pattern]

with:

def subfinder(mylist, pattern):
    # Wrap filter call in list() if on Python 3 and you need a list, not a generator
    return filter(set(pattern).__contains__, mylist)

This behaves the same way, but avoids needing to store the temporary set to a name, and pushes all the filtering work to C.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
0
def sublist_in_list(sub, lis):
    return str(sub).strip('[]') in str(lis).strip('[]')
norok2
  • 25,683
  • 4
  • 73
  • 99
Evgen
  • 1