8

I want to check if a sublist is present in another (larger) list, in the exact same order of elements. I also want it to allow wildcards. For example I have the following lists:

>>> my_lists
[[0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 2, 2],
[1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0, 1, 1, 1, 1],
[1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 0, 0, 1, 1, 1, 1],
[1, 1, 1, 1, 0, 2, 1, 2, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 1, 1, 2, 1, 2, 2, 1, 1],
[0, 1, 1, 1, 1, 2, 2, 2, 1, 0, 0, 0, 2, 2, 1, 1, 0, 0, 1, 1, 0],
[1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 0],
[0, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 1, 0, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1],
[0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]

And the sublist: [0, 0, 0, 1]. If I want to find which lists contain this exact sublist I can do (taken from here):

def my_func(_list, sub_list):
    n = len(sub_list)
    return any((sub_list== _list[i:i+n]) for i in range(len(_list)-n+1))

for l in my_lists:
    if my_func(l, [0, 0, 0, 1]):
        print(l)

... which basically makes all possible sublists of the same length as the sub_list, and checks whether or not any are equal. And I would get the following output since these lists contain [0, 0, 0, 1]:

[1, 1, 1, 1, 0, 2, 1, 2, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 1, 0]
[0, 0, 0, 0, 1, 1, 2, 1, 2, 2, 1, 1]
[1, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 0]
[0, 0, 0, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Now I also want to add wildcards, meaning that I can give the sublist wildcard elements. For example, now I want to find the sublist [*, *, 0, 0, 0, 1, *]. The asterisks here mean that for those elements, the value could be anything in the list. But for those asterisks there must be a value. The sublist [*, *, 0, 0, 0, 1, *] would now output:

[1, 1, 1, 1, 0, 2, 1, 2, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 1, 0]
[1, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 0]

Note that now [0, 0, 0, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] is not included since this list doesn't have two values before the [0, 0, 0, 1] sequence starts. The same goes for [0, 0, 0, 0, 1, 1, 2, 1, 2, 2, 1, 1], which also doesn't have two values before the sequence. Note that the asterisk could be anything such as np.nan.

How would I extend above code to allow for the wildcards?

sandertjuh
  • 550
  • 2
  • 13

5 Answers5

2

If we create a SuperInt class that allows us to wrap int but make it equal to another instance with a same value (or a 'normal' int with the same value) and the string '*', we can use the same code you already have.

WILDCARD = '*'

class SuperInt(int):
    def __eq__(self, other):
        if not isinstance(other, self.__class__) and other == WILDCARD:
        # or
        # if isinstance(other, str) and other == '*': but there might be a caveat with that
            return True
        return super().__eq__(other)

Converting your my_lists to use SuperInt instances:

for i, li in enumerate(my_lists):
    my_lists[i] = list(map(SuperInt, li))

running the exact same code you already have (just replacing * with WILDCARD as defined above):

def my_func(_list, sub_list):
    n = len(sub_list)
    return any((sub_list == _list[i:i+n]) for i in range(len(_list)-n+1))

for l in my_lists:
    if my_func(l, [WILDCARD, WILDCARD, 0, 0, 0, 1, WILDCARD]):
        print(l)

Outputs

[1, 1, 1, 1, 0, 2, 1, 2, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 1, 0]
[1, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 0]
DeepSpace
  • 78,697
  • 11
  • 109
  • 154
2

Another solution, with custom compare function:

def custom_cmp(l1, l2):
    if len(l1) != len(l2):
        return False

    for a, b in zip(l1, l2):
        if a == "*":  # you can check here for b=='*' if you wish
            continue
        if a != b:
            return False

    return True


def my_func(_list, sub_list):
    n = len(sub_list)
    return any(
        custom_cmp(sub_list, _list[i : i + n])
        for i in range(len(_list) - n + 1)
    )


for l in my_lists:
    if my_func(l, ["*", "*", 0, 0, 0, 1, "*"]):
        print(l)

Prints:

[1, 1, 1, 1, 0, 2, 1, 2, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 1, 0]
[1, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 0]
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
2

One way is to use all when checking against sublists and an if that skips asterisks:

def my_func(a_list, sub_list):
    n = len(sub_list)

    # list-level comparison is now via element-wise
    return any(all(sub_item == chunk_item
                   for sub_item, chunk_item in zip(sub_list, a_list[i:i+n])
                   if not np.isnan(sub_item))  # "is_not_asterisk" condition
               for i in range(len(a_list)-n+1))

where I used not np.isnan(...) as the asterisk condition as mentioned in the question; but it could be many things: e.g., if asterisk is literally "*" in sublists, then the condition there is changed to if sub_item != "*".

sample with np.nan as asterisk:

for a_list in my_lists:
    if my_func(a_list, [np.nan, np.nan, 0, 0, 0, 1, np.nan]):
        print(a_list)

gives

[1, 1, 1, 1, 0, 2, 1, 2, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 1, 0]
[1, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 0]

all returns True if the iterable is empty so if an all-asterisk sub-list is passed, it will return True for all candidates (as long as their length permits, which any will handle because any with empty iterable is False!).

Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
0

If the elements in your lists are always single character (such as numbers in your example), you can convert to string and search your motif using the re module. This is more than 10 times faster than @Mustafa Aydın answer, and in my opinion quite fast to write as a one liner. But of course it doesn't work on datasets with multi-character elements.

import re
[l for l in my_lists if re.search('..0001.', ''.join(map(str, l)))]

Alternatively, using filter:

list(filter(lambda l: re.search('..0001.', ''.join(map(str, l))), my_lists))

If you need to input your query as a list with the proposed format for wildcards:

pattern = ['*', '*', 0, 0, 0, 1, '*']
[l
 for l in my_lists
 if re.search(''.join(map(str, pattern)).replace('*', '.'),
              ''.join(map(str, l)))
]
mozway
  • 194,879
  • 13
  • 39
  • 75
0

The question has been answered, but here's another option that I like to think of as 'old school.'

my_lists = [
    [0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 2, 2],
    [1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 0, 0, 1, 1, 1, 1],
    [1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 0, 0, 1, 1, 1, 1],
    [1, 1, 1, 1, 0, 2, 1, 2, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 1, 0],
    [0, 0, 0, 0, 1, 1, 2, 1, 2, 2, 1, 1],
    [0, 1, 1, 1, 1, 2, 2, 2, 1, 0, 0, 0, 2, 2, 1, 1, 0, 0, 1, 1, 0],
    [1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 0, 1, 0, 0, 0],
    [1, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 0],
    [0, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 0, 0, 1, 1, 1, 1, 1, 1, 1],
    [0, 0, 1, 1, 0, 0, 0],
    [0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0],
    [0, 1, 1, 0, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1],
    [0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
    [0, 0, 0, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
]

my_sub_list = [0, 0, 0, '*', 1]

def my_function(_list, _sub):
    a = len(_list)
    b = len(_sub)
    x = 0                                              # sublist index
    y = 0                                              # list index
    while y < a:
        if (_sub[x] == _list[y]) or (_sub[x] == '*'):  # we have a match
            x += 1                                     # increment sublist index counter
            y += 1                                     # increment list index counter
            if x == b:                                 # if sublist index equals sublist length, then we've found a match
                print(_list)
                return
        else:
            if x > 0:                                   # We had a partial match
                y -= x - 1                              # Resetting index counter to next index after previous match
            else:
                y += 1                                  # No match, so we're moving to the next index
            x = 0                                       # Resetting sublist index to the beginning


for l in my_lists:
    my_function(l, my_sub_list)
VikingGlen
  • 1,705
  • 18
  • 18