Fastest way to determine if an ordered sublist is in a large lists of lists?

Question

Suppose I have a my_huge_list_of_lists with 2,000,000 lists in it, each list about 50 in length.

I want to shorten the 2,000,000 my_huge_list_of_lists by discarding sublists that do not contain two elements in the sequence.

So far I have:

# https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python
def check_if_list_is_sublist(lst, sublst):
    #checks if a list appears in order in another larger list.
    n = len(sublst)
    return any((sublst == lst[i:i + n]) for i in xrange(len(lst) - n + 1))

my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                            if not check_if_list_is_sublist(x, [a,b])]
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                            if not check_if_list_is_sublist(x, [b,a])]

The consecutiveness of the search term [a,b] or [b,a] is important so I can't use a set.issubset().

I find this slow. I'd like to speed it up. I've considered a few options like using an 'early exit' and statement:

my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                            if (a in x and not check_if_list_is_sublist(x, [a,b]))]

and less times in the for loop with an or statement:

my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                            if not (check_if_list_is_sublist(x, [a,b])
                                    or check_if_list_is_sublist(x, [b,a]))]

and also working on speeding up the function (WIP)

# https://stackoverflow.com/questions/48232080/the-fastest-way-to-check-if-the-sub-list-exists-on-the-large-list
def check_if_list_is_sublist(lst, sublst):
        checks if a list appears in order in another larger list.
        set_of_sublists = {tuple(sublst) for sublist in lst}

and done some searching on Stack Overflow; but can't think of a way because the number of times check_if_list_is_sublist() is called is len(my_huge_list) * 2.

edit: add some user data as requested

from random import randint
from string import ascii_lowercase
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(2000000)]
my_neighbor_search_fwd = [i,c]
my_neighbor_search_rev = my_neighbor_search_fwd.reverse()

My idea, create one hash function, `hash(sub_list) for sub_list in large_list` then save into one dict. compare hash value for each sorted sublist, if matched, checked whether whole list are same (because two different sublist may have same hash value) — Sphinx, Mar 27 '18 at 00:56
Some sample input data would be useful here, so please [edit] your question and add some that folks can use for testing. — martineau, Mar 27 '18 at 01:16
@ droooze all are a mix of 2 custom objects of a class that was constructed. — Joylove, Mar 27 '18 at 03:36

Oluwafemi Sule · Accepted Answer · 2018-03-28T09:57:07.310

3

Unpack the item in the n-sized subsequence into n variables. Then write a list comprehension to filter the list doing a check for a, b or b, a in the sub-list.e.g.

a, b = sublst

def checklst(lst, a, b):
    l = len(lst)
    start = 0
    while True:
        try:
            a_index = lst.index(a, start)
        except ValueError:
            return False
        try:
            return a_index > -1 and lst[a_index+1] == b
        except IndexError:
            try:
                return a_index > -1 and lst[a_index-1] == b
            except IndexError:
                start = a_index + 1
                if start == l:
                    return False
                continue # keep looking at the next a

%timeit found = [l for l in lst if checklst(l, a, b)]
1.88 s ± 31.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit found = [x for x in lst if (a in x and not check_if_list_is_sublist(x, [a,b]))]
22.1 s ± 1.67 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Mar 28 '18 at 09:57

answered Mar 27 '18 at 04:37

Oluwafemi Sule

36,144
1
56
81

1

I see. I'll edit my answer to reflect this. Thanks for your comment. – Oluwafemi Sule Mar 27 '18 at 16:13
1

That's a vast improvement in my opinion, and empirically (according to my benchmark results), it appears to be the fastest—by a substantial amount—of any other answer posted so far (and produces the correct results, as far as I can tell from some limited testing of it I did on my own). – martineau Mar 27 '18 at 21:42
This returns true when a, b are found in l. So there's a not missing in the timeit – Joylove Mar 27 '18 at 23:41
1

@Joylove Yes it does. It also returns True when b, a are found in l. I have edited my answer to take into account there may be other a's in the sequence. – Oluwafemi Sule Mar 28 '18 at 10:01
Thank you Oluwafemi. I've implemented your solution and I see a significant improvement. I was wondering, would it be faster again if there was an early exit criteria based on the presence of b? After the a check? – Joylove Mar 28 '18 at 15:30
1

@Joylove: Couldn't you test that yourself? – martineau Mar 28 '18 at 17:18
@ martineau yes I should. And thanks BTW for your timing code. Thanks to all that replied actually. We aren't allowed to reply that individually. – Joylove Mar 28 '18 at 20:03

score 1 · Answer 2 · answered Mar 27 '18 at 05:12

So, I can't think of any clever algorithm checks to really reduce the amount of work here. However, you are doing a LOT of allocations in your code, and iterating too far. So, just moving some declarations out of the function a bit got me

sublst = [a, b]
l = len(sublst)
indices = range(len(sublst))
def check_if_list_is_sublist(lst):
    for i in range(len(lst) - (l -1)):
        if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
            return True
        if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
            return True
    return False

my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                           if not check_if_list_is_sublist(x)]

Which reduced the run-time of your sample code above by about 50%. With a list this size, spawning some more processes and dividing the work would probably see a performance increase as well. Can't think of any way to really reduce the amount of comparisons though...

Sphinx · Answer 3 · 2018-03-27T22:31:58.160

For search match in one large list, I believe hash(element) then build indexes will be one good solution.

The benefit you will get: build indexes once, save your time for future use (don't need to loop again and again for each search). Even, we can build indexes when launching the program, then release it when program exits,

Below codes use two methods to get hash value: hash() and str(); sometimes you should customize one hash function based on your specific scenarios.

If use str(), the codes seems simple, and don't need to consider the hash conflict. But it may cause memory bomb up.

For hash(), I used the list to save all sub_lst which has same hash value. and you can use hash(sub_lst)%designed_length to control hash size (but it will increase the hash conflict rate).

Output for below codes:

By Hash: 0.00023986603994852955
By str(): 0.00022884208565612796
By OP's: 0.3001317172469765
[Finished in 1.781s]

Test Codes:

from random import randint
from string import ascii_lowercase
import timeit

#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(10000)]
#print(my_huge_list_of_lists)
test_lst = [['a', 'b', 'c' ], ['a', 'b', 'c'] ]
#Solution 1: By using built-in hash function
def prepare1(huge_list, interval=1): #use built-in hash function
    hash_db = {}
    for index in range(len(huge_list) - interval + 1):
        hash_sub = hash(str(huge_list[index:index+interval]))
        if hash_sub in hash_db:
            hash_db[hash_sub].append(huge_list[index:index+interval])
        else:
            hash_db[hash_sub] = [huge_list[index:index+interval]]
    return hash_db

hash_db = prepare1(my_huge_list_of_lists, interval=2)
def check_sublist1(hash_db, sublst): #use built-in hash function
    hash_sub = hash(str(sublst))
    if hash_sub in hash_db:
        return any([sublst == item for item in hash_db[hash_sub]])
    return False

print('By Hash:', timeit.timeit("check_sublist1(hash_db, test_lst)", setup="from __main__ import check_sublist1, my_huge_list_of_lists, test_lst, hash_db ", number=100))

#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): #use str() as hash function
    return { str(huge_list[index:index+interval]):huge_list[index:index+interval] for index in range(len(huge_list) - interval + 1)}

hash_db = prepare2(my_huge_list_of_lists, interval=2)
def check_sublist2(hash_db, sublst): #use str() as hash function
    hash_sub = str(sublst)
    if hash_sub in hash_db:
        return sublst == hash_db[hash_sub]
    return False

print('By str():', timeit.timeit("check_sublist2(hash_db, test_lst)", setup="from __main__ import check_sublist2, my_huge_list_of_lists, test_lst, hash_db ", number=100))

#Solution 3: OP's current solution
def check_if_list_is_sublist(lst, sublst):
    #checks if a list appears in order in another larger list.
    n = len(sublst)
    return any((sublst == lst[i:i + n]) for i in range(len(lst) - n + 1))

print('By OP\'s:', timeit.timeit("check_if_list_is_sublist(my_huge_list_of_lists, test_lst)", setup="from __main__ import check_if_list_is_sublist, my_huge_list_of_lists, test_lst ", number=100))

If you'd like to remove the matched elements from one list, it is doable, but the effect is you may have to rebuild the indexes for the new list. Unless the list is a chain list then save the pointer for each element in the indexes. I just google Python how to get the pointer for one element of a list, but can't find anything helpful. If someone knows how to do, please don't hesitate to share your solution. Thanks.

Below is one sample: (it generate one new list instead of return original one, sometimes we still need to filter something from original list)

from random import randint
from string import ascii_lowercase
import timeit

#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 1)] for x in range(2)] for y in range(100)]
#print(my_huge_list_of_lists)
test_lst = [[['a', 'b'], ['a', 'b'] ], [['b', 'a'], ['a', 'b']]]
#Solution 1: By using built-in hash function
def prepare(huge_list, interval=1): #use built-in hash function
    hash_db = {}
    for index in range(len(huge_list) - interval + 1):
        hash_sub = hash(str(huge_list[index:index+interval]))
        if hash_sub in hash_db:
            hash_db[hash_sub].append({'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]})
        else:
            hash_db[hash_sub] = [{'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]}]
    return hash_db

hash_db = prepare(my_huge_list_of_lists, interval=2)

def check_sublist(hash_db, sublst): #use built-in hash function
    hash_sub = hash(str(sublst))
    if hash_sub in hash_db:
        return [ item for item in hash_db[hash_sub] if sublst == item['data'] ]
    return []

def remove_if_match_sublist(target_list, hash_db, sublsts):
    matches = []
    for sublst in sublsts:
        matches += check_sublist(hash_db, sublst)
    #make sure delete elements from end to begin
    sorted_match = sorted(matches, key=lambda item:item['beg'], reverse=True)
    new_list = list(target_list)
    for item in sorted_match:
        del new_list[item['beg']:item['end']]
    return new_list

print('Removed By Hash:', timeit.timeit("remove_if_match_sublist(my_huge_list_of_lists, hash_db, test_lst)", setup="from __main__ import check_sublist, my_huge_list_of_lists, test_lst, hash_db, remove_if_match_sublist ", number=1))

What is the purpose of `test_lst`? It looks like it's only being printed, but isn't actually used in either approach. Also, I don't think code that only works with string elements is what the OP is needs because in a comment under their question they say the elements of the lists are all instances of some custom class (i.e. not strings). — martineau, Mar 27 '18 at 19:05
`test_lst` is one test case as one input for sublst when call `check_if_list_is_sublist`. Even the elements of the lists are the instances of some custom class, it doesnt matter, it may cause increase hash conflict, but it doesn't affect the result. Even we can customize one hash function for this purpose. — Sphinx, Mar 27 '18 at 19:10
OK, thanks for the clarifications—I'll just take your word that it can easily be customized for lists of arbitrary objects. Regardless, for a list-of-lists of single character strings it doesn't really seem to be must of a contender (based on the benchmark results I posted). — martineau, Mar 27 '18 at 20:18
@martineau, the result is slow because build indexes will take some times, but as what I explained, once build, you don't need to loop the list. For example, we can take 10secs to build indexes when launch the program, then when the user want to search one sub_lst, it will only take less 10ms. For other solutions, it will take more than 10ms even worse if the large list is very huge. Because other solutions will loop whole list for each search. But hash solution will only execute whole list scan once. That is why my `timeit` only execute `check_sublist()`. — Sphinx, Mar 27 '18 at 20:26
Your function only returns a boolean, not a list with the unacceptable sublists removed, so It needs to be called multiple times. The OP want's to remove all sublists from the list-of-lists that contain the target sequence, which is why I think any `check_sublist_xxx()` type of function would have to be applied to each one of them as is being done in the code in their question (which was taken from @Nas Banov's answer to a linked question). The results from the usage of `timeit.timeit()` in your answer don't do this, so are fairly meaningless as far as I can tell since they don't do that. — martineau, Mar 27 '18 at 21:32
Please [edit] your answer and incorporate that into it — then I'll update my benchmark code and results accordingly. — martineau, Mar 27 '18 at 21:45
@martineau, Thanks for the patient. I am not aware of OP wants to remove sublists... Already added the codes to show how to delete the elements based on hash indexes, but it has some restrictions. — Sphinx, Mar 27 '18 at 22:30
If I where to continue to remove more sublists using new search pairs, I rebuild the hash each time my_huge_list_of_lists gets smaller. How would I adapt in that case? — Joylove, Mar 27 '18 at 23:48
Sphinx: Your latest revision—with the added `remove_if_match_sublist()` function—doesn't seem to work when I test it. Not sure why, but unfortunately I'm won't be spending any more time on this question or your answer today, so you might want to look into that issue yourself. — martineau, Mar 27 '18 at 23:53
@martineau, probably caused by the last parameter for `remove_if_match_sublist`, its last parameter is `[sub_lst]`, not `sub_lst` — Sphinx, Mar 28 '18 at 00:36
@Joylove, like this`new_list = remove_if_match_sublist(my_huge_list_of_lists, prepare(my_huge_list_of_lists, interval=2), [sub_lst0, sub_lst1, sub_lst2]) new_list = remove_if_match_sublist(new_list, prepare(new_list, interval=2), [sub_lst3, sub_lst4]) new_list = remove_if_match_sublist(new_list, prepare(new_list, interval=2), [sub_lst5, sub_lst6])`, — Sphinx, Mar 28 '18 at 00:40
@Joylove but i don't think it is a good idea. Because build indexes may be slower than simple loop. That is why I designed remove_if_match_sublist supports multiple sub_lst. It will help to prevent from rebuilding indexes when filter/remove multiple sub_lst for same huge list. — Sphinx, Mar 28 '18 at 00:44
@Joylove, if only query, I prefer hash solution, if need to support **remove** purpose, I don't recommand it unless your list is chain list (heap not stack) and Python supports the pointer for heap memory. — Sphinx, Mar 28 '18 at 00:46

score 1 · Answer 4 · answered Mar 27 '18 at 20:13

Although this isn't what you'd call an "answer" per se, but rather it's a benchmarking framework that should help you determine the quickest way to accomplish what you want because it allows relatively easy modification as well as the addition of different approaches.

I've put the answers currently posted into it, as well as the results of running it with them.

Caveats: Note that I haven't verified that all the tested answers in it are "correct" in the sense that they actually do what you want, nor how much memory they'll consume in the process—which might be another consideration.

Currently it appears that @Oluwafemi Sule's answer is the fastest by a order of magnitude (10x times) from the closest competitor.

from __future__ import print_function
from collections import namedtuple
import sys
from textwrap import dedent
import timeit
import traceback

N = 10  # Number of executions of each "algorithm".
R = 3  # Number of repetitions of those N executions.

from random import randint, randrange, seed
from string import ascii_lowercase

a, b = 'a', 'b'
NUM_SUBLISTS = 1000
SUBLIST_LEN = 50
PERCENTAGE = 50  # Percentage of sublist that should get removed.

seed(42)  # Initialize random number so the results are reproducible.
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for __ in range(SUBLIST_LEN)]
                                for __ in range(NUM_SUBLISTS)]

# Put the target sequence in percentage of the sublists so they'll be removed.
for __ in range(NUM_SUBLISTS*PERCENTAGE // 100):
    list_index = randrange(NUM_SUBLISTS)
    sublist_index = randrange(SUBLIST_LEN)
    my_huge_list_of_lists[list_index][sublist_index:sublist_index+2] = [a, b]

# Common setup for all testcases (executed before any algorithm specific setup).
COMMON_SETUP = dedent("""
    from __main__ import a, b, my_huge_list_of_lists, NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE
""")

class TestCase(namedtuple('CodeFragments', ['setup', 'test'])):
    """ A test case is composed of separate setup and test code fragments. """
    def __new__(cls, setup, test):
        """ Dedent code fragment in each string argument. """
        return tuple.__new__(cls, (dedent(setup), dedent(test)))

testcases = {
    "OP (Nas Banov)": TestCase("""
        # https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python
        def check_if_list_is_sublist(lst, sublst):
            ''' Checks if a list appears in order in another larger list. '''
            n = len(sublst)
            return any((sublst == lst[i:i+n]) for i in range(len(lst) - n + 1))
        """, """
        shortened = [x for x in my_huge_list_of_lists
                        if not check_if_list_is_sublist(x, [a, b])]
        """
    ),
    "Sphinx Solution 1 (hash)": TestCase("""
        # https://stackoverflow.com/a/49518843/355230

        # Solution 1: By using built-in hash function.
        def prepare1(huge_list, interval=1): # Use built-in hash function.
            hash_db = {}
            for index in range(len(huge_list) - interval + 1):
                hash_sub = hash(str(huge_list[index:index+interval]))
                if hash_sub in hash_db:
                    hash_db[hash_sub].append(huge_list[index:index+interval])
                else:
                    hash_db[hash_sub] = [huge_list[index:index+interval]]
            return hash_db

        def check_sublist1(hash_db, sublst): # Use built-in hash function.
            hash_sub = hash(str(sublst))
            if hash_sub in hash_db:
                return any([sublst == item for item in hash_db[hash_sub]])
            return False
        """, """
        hash_db = prepare1(my_huge_list_of_lists, interval=2)
        shortened = [x for x in my_huge_list_of_lists
                        if check_sublist1(hash_db, x)]
        """
    ),
    "Sphinx Solution 2 (str)": TestCase("""
        # https://stackoverflow.com/a/49518843/355230

        #Solution 2: By using str() as hash function
        def prepare2(huge_list, interval=1): # Use str() as hash function.
            return {str(huge_list[index:index+interval]):huge_list[index:index+interval]
                        for index in range(len(huge_list) - interval + 1)}


        def check_sublist2(hash_db, sublst): #use str() as hash function
            hash_sub = str(sublst)
            if hash_sub in hash_db:
                return sublst == hash_db[hash_sub]
            return False
        """, """
        hash_db = prepare2(my_huge_list_of_lists, interval=2)
        shortened = [x for x in my_huge_list_of_lists
                        if check_sublist2(hash_db, x)]
        """
    ),
    "Paul Becotte": TestCase("""
        # https://stackoverflow.com/a/49504792/355230
        sublst = [a, b]
        l = len(sublst)
        indices = range(len(sublst))

        def check_if_list_is_sublist(lst):
            for i in range(len(lst) - (l -1)):
                if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
                    return True
                if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
                    return True
            return False
        """, """
        shortened = [x for x in my_huge_list_of_lists
                        if not check_if_list_is_sublist(x)]
        """
    ),
    "Oluwafemi Sule": TestCase("""
        # https://stackoverflow.com/a/49504440/355230
        def checklst(lst, a, b):
            try:
                a_index = lst.index(a)
            except ValueError:
                return False
            try:
                return a_index > -1 and lst[a_index+1] == b
            except IndexError:
                try:
                    return a_index > -1 and lst[a_index-1] == b
                except IndexError:
                    return False
        """, """
        shortened = [x for x in my_huge_list_of_lists
                        if not checklst(x, a, b)]
        """
    ),
}

# Collect timing results of executing each testcase multiple times.
try:
    results = [
        (label,
         min(timeit.repeat(testcases[label].test,
                           setup=COMMON_SETUP + testcases[label].setup,
                           repeat=R, number=N)),
        ) for label in testcases
    ]
except Exception:
    traceback.print_exc(file=sys.stdout)  # direct output to stdout
    sys.exit(1)

# Display results.
print('Results for {:,d} sublists of length {:,d} with {}% percent of them matching.'
        .format(NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE))
major, minor, micro = sys.version_info[:3]
print('Fastest to slowest execution speeds using Python {}.{}.{}\n'
      '({:,d} executions, best of {:d} repetitions)'.format(major, minor, micro, N, R))
print()

longest = max(len(result[0]) for result in results)  # length of longest label
ranked = sorted(results, key=lambda t: t[1]) # ascending sort by execution time
fastest = ranked[0][1]
for result in ranked:
    print('{:>{width}} : {:9.6f} secs, rel speed {:5.2f}x, {:6.2f}% slower '
          ''.format(
                result[0], result[1], round(result[1]/fastest, 2),
                round((result[1]/fastest - 1) * 100, 2),
                width=longest))
print()

Output:

Results for 1,000 sublists of length 50 with 50% percent of them matching
Fastest to slowest execution speeds using Python 3.6.4
(10 executions, best of 3 repetitions)

          Oluwafemi Sule :  0.006441 secs, rel speed  1.00x,   0.00% slower
            Paul Becotte :  0.069462 secs, rel speed 10.78x, 978.49% slower
          OP (Nas Banov) :  0.082758 secs, rel speed 12.85x, 1184.92% slower
 Sphinx Solution 2 (str) :  0.152119 secs, rel speed 23.62x, 2261.84% slower
Sphinx Solution 1 (hash) :  0.154562 secs, rel speed 24.00x, 2299.77% slower

Fastest way to determine if an ordered sublist is in a large lists of lists?

4 Answers4