Finding a tuple in large a very large list

Question

Suppose I have a list of tuples:

tuple_library = [('a', 'z', '1'), ('r', '3', 'b'), ('m', '1', 'l')]

What I want to do is to check if the following tuple is present on the tuple_library.

search_list = [('a','a','1'), ('m', '1', 'l')]


def search_the_tupple(t_lib, s_list):
    for item in t_lib:
        if item in s_list:
           return(item)

print(search_the_tupple(tuple_library, search_list))

this code works ok if the tuple_library and the search_list is small, but as those two item increases, the longer time needed in order for it to complete.

How do we approach this problem?

@rdas Wouldn't sorting take about N log N time? And that's for one of the two lists. — 9769953, Mar 23 '21 at 11:46
put it in a set if you need to do this repeatedly. Also: Fix your coding errors - this does not run at all. — Patrick Artner, Mar 23 '21 at 11:46
@PatrickArtner Other than a missed colon (one mistake), this runs fine. Also, how does putting it in a set helps? — 9769953, Mar 23 '21 at 11:48
If you have a tuples of strings, or tuples of only hashable objects, `set` is the way to go. — Reti43, Mar 23 '21 at 11:49
@0 As is it does not run. Putting the bigger list in a set makes the lookup constant time O(1) which alleviates the O(N) of a list. Putting both in a set would make it even faster by set-intersection.This would also give you _all_ the intersection, not just the first as the code above if fixed would provide — Patrick Artner, Mar 23 '21 at 11:50
Duplicate: https://stackoverflow.com/questions/5993621/fastest-way-to-search-a-list-in-python and plenty others that aim in the same direction. — Patrick Artner, Mar 23 '21 at 11:54
Related: https://stackoverflow.com/questions/7571635/fastest-way-to-check-if-a-value-exists-in-a-list — Patrick Artner, Mar 23 '21 at 11:56
Set intersection is the way to go as long as you dont constantly need indexing and dont care about duplicates. — Niteya Shah, Mar 26 '21 at 08:09

score 11 · Answer 1 · answered Mar 26 '21 at 08:22

Convert tuple_library and search_list to a python set with set()
Return the intersection of the two sets, that is, all the elements that are in both tuple_library and search_list

tuple_library = [('a', 'z', '1'), ('r', '3', 'b'), ('m', '1', 'l')]
search_list = [('a','a','1'), ('m', '1', 'l')]


def search_the_tupple(t_lib, s_list):
    return set(t_lib).intersection(set(s_list))


print(search_the_tupple(tuple_library, search_list))

This is assuming that you want the order of the tuple to be preserved. (so that a ('m', '1', 'l') would appear but a ('m', 'l', '1') would not.

Fyi: it does not matter whether you do t_lib.intersection(s_list) or the other way around.

To achieve the result in OP, use `next((s for s in t_lib if s in the_intersection), None)`. — Aaron, Apr 01 '21 at 14:45

score 1 · Answer 2 · answered Mar 26 '21 at 10:33

You have following methods to get the answer:

Plain vanilla search by using needle in haystack operator.
Convert your big tuple haystack to a set and look into set.
Convert your needle and haystack both into a set and return the intersection.

As you will see by the stats at the end, method 1 becomes increasingly slower as your haystack increases in size. Both 2 & 3 are better with 3 outperforming 2.

In fact, if you convert your haystack into a set just once for the duration of program and then use that for method 3 - which is set intersection, you get the best possible performance.

I have not included sorting the big list in O(nlogn) and then searching in that with bisect O(mlogn) because that will not be better than the method 3 which is O(m)
Here is sample code used to find the relative performance. Feed in your sample size and see for yourself how this it perform on your machine configuration.

import timeit
import time
import random
import string
from random import randint
from random import sample

data_values = string.ascii_lowercase + string.digits
len_data_values = len(data_values)
random_lower_limit = 0
random_upper_limit = len_data_values - 1

def normal_search_multiple_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    ans = set()
    for single_tuple in tuples_to_search:
        if single_tuple in big_tuple_list:
            ans.add(single_tuple)
    return ans

def set_conversion_search_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    big_tuple_set = set(big_tuple_list)
    ans = set()
    for single_tuple in tuples_to_search:
        if single_tuple in big_tuple_set:
            ans.add(single_tuple)
    return ans

def intersect_search_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    tuples_to_search_set = set(tuples_to_search)
    big_tuple_set = set(big_tuple_list)
    return tuples_to_search_set.intersection(big_tuple_set)

def search_in_converted_set(tuples_to_search, big_tuple_set):
    ans = set()
    for single_tuple in tuples_to_search:
        if single_tuple in big_tuple_set:
            ans.add(single_tuple)
    return ans

def intersect_search_tuples_in_converted_set(tuples_to_search, big_tuple_set):
    tuples_to_search_set = set(tuples_to_search)
    return tuples_to_search_set.intersection(big_tuple_set)
    
def get_big_data(big_tuple_list_size):
    big_tuple_list = [(
                        data_values[randint(random_lower_limit, random_upper_limit)], 
                        data_values[randint(random_lower_limit, random_upper_limit)], 
                        data_values[randint(random_lower_limit, random_upper_limit)]
                    ) for _ in range(big_tuple_list_size)]
    return big_tuple_list

def get_small_data(big_tuple_list, tuples_to_search_size):
    return sample(big_tuple_list, tuples_to_search_size)

def get_required_data(big_tuple_list_size, tuples_to_search_size):
    big_tuple_list = get_big_data(big_tuple_list_size)
    tuples_to_search = get_small_data(big_tuple_list, tuples_to_search_size)
    return (big_tuple_list, tuples_to_search)

def convert_list_to_set(big_list):
    return set(big_list)

if __name__ == '__main__':

    test_combinations = [(10, 5), (100, 5), (1000, 5)]

    functions_to_test = [
                        normal_search_multiple_tuples_in_tuples_list,
                        set_conversion_search_tuples_in_tuples_list,
                        intersect_search_tuples_in_tuples_list
                        ]

    for big_tuple_list_size, tuples_to_search_size in test_combinations:
        tuples_to_search, big_tuple_list = get_required_data(big_tuple_list_size, tuples_to_search_size)

        print(f'For a run of searching {tuples_to_search_size} needles in {big_tuple_list_size} size haystack')
        for func in functions_to_test:
            print(f'''Time taken by {func.__name__}: {timeit.timeit('func(tuples_to_search, big_tuple_list)', globals=globals())}''')
        print()

    big_tuple_list = get_big_data(1000000)

    before = time.time()
    big_tuple_set = convert_list_to_set(big_tuple_list)
    print(f'Time taken for one-time job of converting list to set: {time.time() - before}')

    for small_sample_size in [5, 10, 50, 100]:
        tuples_to_search = get_small_data(big_tuple_list, small_sample_size)
        for func in [search_in_converted_set, intersect_search_tuples_in_converted_set]:
            print(f'''Time taken by {func.__name__} for searching {small_sample_size} tuples: {timeit.timeit('func(tuples_to_search, big_tuple_set)', globals=globals())}''')
        print()

For my machine, I get the following stats (timeit runs by default for 1M operations, specify numbers argument if you want to change that.) Memory footprint did not go beyond 85M for 1M items.

For a run of searching 5 needles in 10 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 1.972483009998541
Time taken by set_conversion_search_tuples_in_tuples_list: 1.5855916219989012
Time taken by intersect_search_tuples_in_tuples_list: 1.448762740001257

For a run of searching 5 needles in 100 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 18.045348233001278
Time taken by set_conversion_search_tuples_in_tuples_list: 6.803115938000701
Time taken by intersect_search_tuples_in_tuples_list: 6.169837320001534

For a run of searching 5 needles in 1000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 177.56795450999925
Time taken by set_conversion_search_tuples_in_tuples_list: 56.36051504599891
Time taken by intersect_search_tuples_in_tuples_list: 46.09082598700115

Time taken for one-time job of converting list to set: 0.17918157577514648
Time taken by search_in_converted_set for searching 5 tuples: 1.241585361000034
Time taken by intersect_search_tuples_in_converted_set for searching 5 tuples: 1.01804721499866

Time taken by search_in_converted_set for searching 10 tuples: 2.011633182002697
Time taken by intersect_search_tuples_in_converted_set for searching 10 tuples: 1.3614019449996704

Time taken by search_in_converted_set for searching 50 tuples: 9.395802918999834
Time taken by intersect_search_tuples_in_converted_set for searching 50 tuples: 5.388387926999712

Time taken by search_in_converted_set for searching 100 tuples: 19.021971608002787
Time taken by intersect_search_tuples_in_converted_set for searching 100 tuples: 12.412382661001175

I have some comments concerning your benchmark, which are too long to include here, but you can see **Update 2** to my answer. — Booboo, Mar 26 '21 at 16:24

Booboo · Accepted Answer · 2021-03-26T17:25:37.903

I would simply split the list of tuples into N sub-lists where N is the number of processors I have on my computer, create a multiprocessing pool of size N and then search the N sub-lists in parallel. I would also initialize each process in the pool with a set search_list so that testing whether an element of the sub-list is one of the sought elements becomes an O(1) operation (in case search_list in reality is larger than 2 elements). ~~Then worker function search_the_tuple converts the passed list into a set and a set intersection is performed with search_list and the results returned.~~ As soon as the first parallel process finds a non-zero-length set returned, the remaining processes are terminated.

The question is whether converting the sub-list to a set and performing an intersection, which is being done by library routines, will be significantly faster than looping through the sub-list and testing set membership, which will be primarily Python bytecode.

import multiprocessing as mp


def init_pool(s_list):
    global search_list
    search_list = set(s_list)

def search_the_tuple(t_sub_lib):
    for item in t_sub_lib:
        if item in search_list:
           return(item)


def split(lst, n):
    # a generator expression to split the library of tuples into n (narly) equal sub lists:
    k, m = divmod(len(lst), n)
    return (lst[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))


def main():

    search_list = [('a','a','1'), ('m', '1', 'l')]

    tuple_library = [
        ('g', 'z', '1'), ('h', '3', 'b'), ('i', 'a', 'l'), ('j', 'z', '1'), ('k', '3', 'b'), ('l', '1', 'l'),
        ('m', 'z', '1'), ('n', '3', 'b'), ('o', '1', 'l'), ('p', 'z', '1'), ('q', '3', 'b'), ('r', '1', 'l'),
        ('a', 'z', '1'), ('r', '3', 'b'), ('m', '1', 'l'), ('d', 'z', '1'), ('e', '3', 'b'), ('f', '1', 'l')
    ]

    n_processors = mp.cpu_count()
    with mp.Pool(n_processors, initializer=init_pool, initargs=(search_list,)) as pool:
        t_sub_libs = split(tuple_library, n_processors)
        for t in pool.imap_unordered(search_the_tuple, t_sub_libs):
            if t is not None:
                print(t)
                break
    # When we exit this block, terminate will be called on pool and
    # all processes will be killed.

# required for Windows:
if __name__ == '__main__':
    main()

Prints:

('m', '1', 'l')

Update

I did a little benchmark to see if it was worthwhile to convert a list into a set and then perform an intersection. In this benchmark I have a list of 20,000 elements and to give the conversion to a set the advantage, the element I am looking for is at the very end:

import timeit

s = set([(19999, 19999), (21000, 21000)])
lst = [(i,i) for i in range(20000)]

def s1():
    for i in lst:
        if i in s:
            return i

def s2():
    s2 = set(lst)
    return s & s2

print(timeit.timeit('s1()', number=1000, globals=globals()))
print(timeit.timeit('s2()', number=1000, globals=globals()))

Prints:

0.9409163000000262
1.8975496000000476

But even here the set-conversion version, s2, was almost twice as slow. But when the value I am looking for is in the middle of the list, which it would be on average, we have:

import timeit

s = set([(10000, 10000), (21000, 21000)])
lst = [(i,i) for i in range(20000)]

def s1():
    for i in lst:
        if i in s:
            return i

def s2():
    s2 = set(lst)
    return s & s2

print(timeit.timeit('s1()', number=1000, globals=globals()))
print(timeit.timeit('s2()', number=1000, globals=globals()))

prints:

0.5094996000000265
1.8004526000001988

And, as you would expect, s1 now runs almost twice as fast.

Conclusions

You should definitely convert search_list to set since it will be searched multiple times. This may be your only optimization that you have available.
You have specified only that tuple_library is very large. That is fairly unspecific. My benchmark suggests that it is not advantageous to convert your tuple_library to a set since it is being searched only once.
Although, I have shown how you would "divide and conquer" using multiprocessing, it is not at all clear to me (again not knowing how large your tuple_library actually is, and it may not even matter) that multiprocessing will improve performance at all (it could even hurt performance). There is overhead in just creating the process pool and then overhead again in passing the sub-lists to the processes that "live" in a different address space. You would just have to try it and see if it helps.

Update 2

There is a problem with the benchmarks done by @lllrnr101. I would have offered this up as a comment but, unfortunately, this is too long for such. First, function normal_search_multiple_tuples_in_tuples_list should be coded as:

def normal_search_multiple_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    s = set(tuples_to_search)
    for t in big_tuple_list:
        if t in s:
            return t

Note that in the OP's question, the OP is returning after the first match is found and is not attempting to find all matches. It is probably more efficient to at least convert tuples_to_search to a set since it will be searched multiple times.

Second, the only two strategies that really need to be compared are normal_search_multiple_tuples_in_tuples_list and intersect_search_tuples_in_tuples_list. The later benchmarks that involves first converting the list to a set and searching it doesn't correctly take into account the cost of conversion and does not correspond to the actual problem (the time printed for converting the list to a set is the cost of single conversion while other times printed are for 100000 iterations).

Third, the OP specifies that the list in question is very large, whatever that means, but I take it to be larger than 10 and 100 and probably 1000. So the benchmark should be done against larger lists with timeit doing fewer iterations so that it completes in a reasonable amount of time. That leads to the following behchmark:

import timeit
import time
import random
import string
from random import randint
from random import sample

data_values = string.ascii_lowercase + string.digits
len_data_values = len(data_values)
random_lower_limit = 0
random_upper_limit = len_data_values - 1

def normal_search_multiple_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    s = set(tuples_to_search)
    for t in big_tuple_list:
        if t in s:
            return t


def intersect_search_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    tuples_to_search_set = set(tuples_to_search)
    big_tuple_set = set(big_tuple_list)
    return tuples_to_search_set.intersection(big_tuple_set)


def get_big_data(big_tuple_list_size):
    big_tuple_list = [(
                        data_values[randint(random_lower_limit, random_upper_limit)],
                        data_values[randint(random_lower_limit, random_upper_limit)],
                        data_values[randint(random_lower_limit, random_upper_limit)]
                    ) for _ in range(big_tuple_list_size)]
    return big_tuple_list

def get_small_data(big_tuple_list, tuples_to_search_size):
    return sample(big_tuple_list, tuples_to_search_size)

def get_required_data(big_tuple_list_size, tuples_to_search_size):
    big_tuple_list = get_big_data(big_tuple_list_size)
    tuples_to_search = get_small_data(big_tuple_list, tuples_to_search_size)
    return (big_tuple_list, tuples_to_search)


if __name__ == '__main__':

    test_combinations = [(10000, 5), (20000, 5), (100000, 5)]

    functions_to_test = [
                        normal_search_multiple_tuples_in_tuples_list,
                        intersect_search_tuples_in_tuples_list
                        ]

    for big_tuple_list_size, tuples_to_search_size in test_combinations:
        tuples_to_search, big_tuple_list = get_required_data(big_tuple_list_size, tuples_to_search_size)

        print(f'For a run of searching {tuples_to_search_size} needles in {big_tuple_list_size} size haystack')
        for func in functions_to_test:
            print(f'''Time taken by {func.__name__}: {timeit.timeit('func(tuples_to_search, big_tuple_list)', number=1000, globals=globals())}''')
        print()

Prints:

For a run of searching 5 needles in 10000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.681931
Time taken by intersect_search_tuples_in_tuples_list: 0.5806513

For a run of searching 5 needles in 20000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.8495027999999998
Time taken by intersect_search_tuples_in_tuples_list: 0.8799418999999999

For a run of searching 5 needles in 100000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 8.51018
Time taken by intersect_search_tuples_in_tuples_list: 8.115366799999999

There isn't a big difference in timings. My own benchmark where one of the sought items is a middle element of the list shows that not converting the list out-performed considerably converting the list to a set. But the big difference between this benchmark and mine is that my benchmark does not convert the tuples_to_search list to a set; this set is pre-created since the OP knows what they are looking for. If we remove the cost of creating that set from the benchmark we have:

import timeit
import time
import random
import string
from random import randint
from random import sample

data_values = string.ascii_lowercase + string.digits
len_data_values = len(data_values)
random_lower_limit = 0
random_upper_limit = len_data_values - 1

def normal_search_multiple_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    for t in big_tuple_list:
        if t in tuples_to_search:
            return t


def intersect_search_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
    big_tuple_set = set(big_tuple_list)
    return tuples_to_search.intersection(big_tuple_set)


def get_big_data(big_tuple_list_size):
    big_tuple_list = [(
                        data_values[randint(random_lower_limit, random_upper_limit)],
                        data_values[randint(random_lower_limit, random_upper_limit)],
                        data_values[randint(random_lower_limit, random_upper_limit)]
                    ) for _ in range(big_tuple_list_size)]
    return big_tuple_list

def get_small_data(big_tuple_list, tuples_to_search_size):
    return sample(big_tuple_list, tuples_to_search_size)

def get_required_data(big_tuple_list_size, tuples_to_search_size):
    big_tuple_list = get_big_data(big_tuple_list_size)
    tuples_to_search = get_small_data(big_tuple_list, tuples_to_search_size)
    return (big_tuple_list, tuples_to_search)


if __name__ == '__main__':

    test_combinations = [(10000, 5), (20000, 5), (100000, 5)]

    functions_to_test = [
                        normal_search_multiple_tuples_in_tuples_list,
                        intersect_search_tuples_in_tuples_list
                        ]

    for big_tuple_list_size, tuples_to_search_size in test_combinations:
        tuples_to_search, big_tuple_list = get_required_data(big_tuple_list_size, tuples_to_search_size)

        print(f'For a run of searching {tuples_to_search_size} needles in {big_tuple_list_size} size haystack')
        s = set(tuples_to_search)
        for func in functions_to_test:
            print(f'''Time taken by {func.__name__}: {timeit.timeit('func(s, big_tuple_list)', number=1000000, globals=globals())}''')
        print()

Prints:

For a run of searching 5 needles in 10000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.16306289999999998
Time taken by intersect_search_tuples_in_tuples_list: 0.6508408

For a run of searching 5 needles in 20000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.16933260000000006
Time taken by intersect_search_tuples_in_tuples_list: 0.6473307000000001

For a run of searching 5 needles in 100000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.1777871999999996
Time taken by intersect_search_tuples_in_tuples_list: 0.7130949000000002

And you can plainly see that you should not convert the list to a set. Finally, if that's not convincing, here is my benchmark where I do the set conversion of the list elements we are trying to match within the search functions and the element in the list we are searching that does match one of these elements is the last element. This should give the maximum advantage to the method that uses set intersection:

import timeit


def s1(s, lst):
    s = set(s)
    for i in lst:
        if i in s:
            return i

def s2(s, lst):
    s = set(s)
    s2 = set(lst)
    return s & s2


lst = [(i,i) for i in range(20000)]
s = [(19999, 19999), (21000, 21000)] # match will be last element of lst
# look for a s in lst:
print(timeit.timeit('s1(s, lst)', number=1000, globals=globals()))
print(timeit.timeit('s2(s, lst)', number=1000, globals=globals()))

Prints:

0.887862600000517
1.8508867000000464

score 0 · Answer 4 · answered Apr 01 '21 at 15:08

In conclusion, use set. Compute the set in advance if possible.

The first column is the original approach. Second bisection search. Third set intersection. Fourth set intersection with precomputed sets.

Note that the difference of first and fourth column between 10k and 100k sample size. Since the sample size is larger than the sample space in 100k sample, the linear search next((s for s in t_lib if s in intersect), None) consume more time. This is because the input items are not unique.

demo       0.14000 0.36631 0.43345 0.32987
Sample size 1
avg        0.20465 0.51461 0.78238 0.62021
min        0.20071 0.45612 0.77142 0.60785
max        0.20777 0.53482 0.79569 0.63588
Sample size 10
avg        0.29292 0.36985 0.15805 0.10697
min        0.29023 0.33839 0.15505 0.10555
max        0.29462 0.38843 0.16004 0.10892
Sample size 100
avg        2.25065 0.57967 0.09441 0.05153
min        0.31502 0.22779 0.07239 0.02602
max        2.48248 0.62714 0.09965 0.05538
Sample size 1000
avg        1.04594 0.35947 0.06731 0.01942
min        0.10262 0.33380 0.06457 0.01773
max        3.87742 0.43988 0.07285 0.02422
Sample size 10000
avg        0.05340 0.48607 0.11466 0.03703
min        0.00144 0.47951 0.11063 0.03608
max        0.17092 0.49262 0.12124 0.03801
Sample size 100000
avg        0.01492 0.63902 0.18956 0.04588
min        0.00152 0.63142 0.18007 0.04313
max        0.03624 0.65317 0.21257 0.05027
Sample size 1000000
avg        0.00236 2.98343 0.71825 0.02207
min        0.00008 2.93934 0.69734 0.01961
max        0.00484 3.05418 0.74362 0.02691

code:

from bisect import bisect_left
from functools import partial
from random import Random
from string import ascii_lowercase, digits
from timeit import timeit

LETTERS = ascii_lowercase + digits


def next_triplet(r=Random(123456)):  # fixed seed
    return tuple(r.choices(LETTERS, k=3))  # len(sample space) = 36 ** 3 = 46656


def search_the_tupple(t_lib, s_list):
    for item in t_lib:
        if item in s_list:
            return (item)


def search_the_tupple_bisect(t_lib, s_list):
    s_list = sorted(s_list)
    size = len(s_list)

    for item in t_lib:
        i = bisect_left(s_list, item)

        if i < size and s_list[i] == item:
            return item


def search_the_tupple_set(t_lib, s_list):
    intersect = set(t_lib).intersection(s_list)
    return next((s for s in t_lib if s in intersect), None)


def search_the_tupple_set_precomputed(t_lib, t_set, s_set):
    intersect = t_set.intersection(s_set)
    return next((s for s in t_lib if s in intersect), None)


def run(tuple_library, search_list):
    number = max(5, int(3 * 1000000 / (len(tuple_library) + len(search_list))))  # keep the runtime manageable

    functions = (
        # pack the test function and arguments together
        partial(search_the_tupple, tuple_library, search_list),
        partial(search_the_tupple_bisect, tuple_library, search_list),
        partial(search_the_tupple_set, tuple_library, search_list),
        partial(search_the_tupple_set_precomputed, tuple_library, set(tuple_library), set(search_list)),
    )

    results = [f() for f in functions]
    assert len(set(results)) == 1, results  # make sure no error

    return [timeit(f, number=number) for f in functions]


def main():
    def print_timing(title, timings):
        print(f'{title:10s}', ' '.join(f'{t:.5f}' for t in timings))

    tuple_library = [('a', 'z', '1'), ('r', '3', 'b'), ('m', '1', 'l')]
    search_list = [('a', 'a', '1'), ('m', '1', 'l')]

    print_timing('demo', run(tuple_library, search_list))

    for sample_size in [10 ** s for s in range(7)]:  # 10 ** 5 = 100_000, which is greater than the sample space
        print('Sample size', sample_size)

        timing_results = []

        for run_index in range(10):
            tuple_library = [next_triplet() for _ in range(sample_size)]
            search_list = [next_triplet() for _ in range(sample_size)]
            timing_results.append(run(tuple_library, search_list))

        timing_min = []
        timing_max = []
        timing_avg = []

        for fn_index, fn_timings in enumerate(zip(*timing_results)):
            timing_avg.append(sum(fn_timings) / len(fn_timings))
            timing_min.append(min(fn_timings))
            timing_max.append(max(fn_timings))

        print_timing('avg', timing_avg)
        print_timing('min', timing_min)
        print_timing('max', timing_max)


if __name__ == '__main__':
    main()

score 0 · Answer 5 · answered Apr 01 '21 at 20:30

First, let us analyze the time complexity of the standard approach to searching a list of tuples within another list of tuples, with each tuple containing n elements.

Now, first to pronounce 2 tuples as equal, we need to compare n elements of one tuple to n elements of the other tuple, which takes n comparisons.

Now, we have to do that for all the number of elements in the search_list - which, let's say is m.

And now, we have to do these 2 above operations for all the elements in the tuple_library - which say is p.

So, the total number of operations here is n * m * p.

Now, if the tuple_library is large, then the time taken for this search operation is extremely high.

To reduce this, we can follow this approach:

First, take the first tuple of the search_list ('a','a','1'), and compare its first element ('a' in this case) with the first element of all of the tuples in tuple_library.
Put the matching tuples in a hash table or dictionary where the key is the first element ('a') and the value is the list of the matching tuples, and delete those tuples from the original search_list.
Now, proceed to the next element in the that tuple ('a' again) and compare it with all the elements found in the above list. That way, you are searching a much smaller list.
For all the matching elements, put them in the dictionary with the key formed of joining the 2 elements ('aa' in this case), and delete them from the earlier list, and so on until you match all the elements in that tuple. 5.After doing this, your dictionary will have 3 keys ( 'a', 'aa', and 'aa1' ) each containing lists of tuples containing them.
Next, you proceed to the next tuple in the search_list, and take its first element - if that exists in the dictionary, then take the next element - see if both of them are present - and so on until you match all the elements in the tuple or you find a mismatch, in which case you take list referenced by the key formed of all the matched elements and search there.
If the first element of the tuple does not occur in the dictionary, then you start the process all over again.

The result of this process is that for every subsequent tuple in the search_list you have to search through lesser and lesser number of tuples from the tuple library. So, instead of searching through n elements, you have to search through log(n) elements.

So, your total running time becomes log(n) * m * p. Now, since the largest list here is the tuple library, reducing the number of searches there would reduce your time complexity substantially.

django-d · Answer 6 · 2021-04-02T21:21:50.347

-2

Create a dictionary. Use the tuples in your tuples_library as dictionary keys. Then check whether the tuples in the search list exists as a key in the dictionary.

In what case would I use a tuple as a dictionary key?

tuple_search_dict = {
    ('a', 'z', '1'): ["present"],
    ('r', '3', 'b'): ["present"],
    ('m', '1', 'l'): ["present"],
}

tuple_library = [('a', 'z', '1'), ('r', '3', 'b'), ('m', '1', 'l')]
search_list = [('a', 'a', '1'), ('m', '1', 'l')]

for each in search_list:
    try:
        tuple_search_dict[each]
        print(f"This tuple EXISTS in the list: {each}")
    except KeyError:
        print(f"This tuple DOES NOT EXIST in your list: {each}")

edited Apr 02 '21 at 21:21

answered Apr 01 '21 at 05:04

django-d

2,210
3
23
41

1

This simply replicates the `set` suggestions. – Prune Apr 01 '21 at 23:49
Nothing replicated, just checking to see if key exists. Sample code added – django-d Apr 02 '21 at 21:23

Finding a tuple in large a very large list

6 Answers6