I would simply split the list of tuples into N sub-lists where N is the number of processors I have on my computer, create a multiprocessing pool of size N and then search the N sub-lists in parallel. I would also initialize each process in the pool with a set search_list
so that testing whether an element of the sub-list is one of the sought elements becomes an O(1) operation (in case search_list
in reality is larger than 2 elements). Then worker function search_the_tuple
converts the passed list into a set and a set intersection is performed with search_list
and the results returned. As soon as the first parallel process finds a non-zero-length set returned, the remaining processes are terminated.
The question is whether converting the sub-list to a set and performing an intersection, which is being done by library routines, will be significantly faster than looping through the sub-list and testing set membership, which will be primarily Python bytecode.
import multiprocessing as mp
def init_pool(s_list):
global search_list
search_list = set(s_list)
def search_the_tuple(t_sub_lib):
for item in t_sub_lib:
if item in search_list:
return(item)
def split(lst, n):
# a generator expression to split the library of tuples into n (narly) equal sub lists:
k, m = divmod(len(lst), n)
return (lst[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))
def main():
search_list = [('a','a','1'), ('m', '1', 'l')]
tuple_library = [
('g', 'z', '1'), ('h', '3', 'b'), ('i', 'a', 'l'), ('j', 'z', '1'), ('k', '3', 'b'), ('l', '1', 'l'),
('m', 'z', '1'), ('n', '3', 'b'), ('o', '1', 'l'), ('p', 'z', '1'), ('q', '3', 'b'), ('r', '1', 'l'),
('a', 'z', '1'), ('r', '3', 'b'), ('m', '1', 'l'), ('d', 'z', '1'), ('e', '3', 'b'), ('f', '1', 'l')
]
n_processors = mp.cpu_count()
with mp.Pool(n_processors, initializer=init_pool, initargs=(search_list,)) as pool:
t_sub_libs = split(tuple_library, n_processors)
for t in pool.imap_unordered(search_the_tuple, t_sub_libs):
if t is not None:
print(t)
break
# When we exit this block, terminate will be called on pool and
# all processes will be killed.
# required for Windows:
if __name__ == '__main__':
main()
Prints:
('m', '1', 'l')
Update
I did a little benchmark to see if it was worthwhile to convert a list into a set and then perform an intersection. In this benchmark I have a list of 20,000 elements and to give the conversion to a set the advantage, the element I am looking for is at the very end:
import timeit
s = set([(19999, 19999), (21000, 21000)])
lst = [(i,i) for i in range(20000)]
def s1():
for i in lst:
if i in s:
return i
def s2():
s2 = set(lst)
return s & s2
print(timeit.timeit('s1()', number=1000, globals=globals()))
print(timeit.timeit('s2()', number=1000, globals=globals()))
Prints:
0.9409163000000262
1.8975496000000476
But even here the set-conversion version, s2
, was almost twice as slow. But when the value I am looking for is in the middle of the list, which it would be on average, we have:
import timeit
s = set([(10000, 10000), (21000, 21000)])
lst = [(i,i) for i in range(20000)]
def s1():
for i in lst:
if i in s:
return i
def s2():
s2 = set(lst)
return s & s2
print(timeit.timeit('s1()', number=1000, globals=globals()))
print(timeit.timeit('s2()', number=1000, globals=globals()))
prints:
0.5094996000000265
1.8004526000001988
And, as you would expect, s1
now runs almost twice as fast.
Conclusions
- You should definitely convert
search_list
to set since it will be searched multiple times. This may be your only optimization that you have available.
- You have specified only that
tuple_library
is very large. That is fairly unspecific. My benchmark suggests that it is not advantageous to convert your tuple_library
to a set since it is being searched only once.
- Although, I have shown how you would "divide and conquer" using multiprocessing, it is not at all clear to me (again not knowing how large your
tuple_library
actually is, and it may not even matter) that multiprocessing will improve performance at all (it could even hurt performance). There is overhead in just creating the process pool and then overhead again in passing the sub-lists to the processes that "live" in a different address space. You would just have to try it and see if it helps.
Update 2
There is a problem with the benchmarks done by @lllrnr101. I would have offered this up as a comment but, unfortunately, this is too long for such. First, function normal_search_multiple_tuples_in_tuples_list
should be coded as:
def normal_search_multiple_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
s = set(tuples_to_search)
for t in big_tuple_list:
if t in s:
return t
Note that in the OP's question, the OP is returning after the first match is found and is not attempting to find all matches. It is probably more efficient to at least convert tuples_to_search
to a set since it will be searched multiple times.
Second, the only two strategies that really need to be compared are normal_search_multiple_tuples_in_tuples_list
and intersect_search_tuples_in_tuples_list
. The later benchmarks that involves first converting the list to a set and searching it doesn't correctly take into account the cost of conversion and does not correspond to the actual problem (the time printed for converting the list to a set is the cost of single conversion while other times printed are for 100000 iterations).
Third, the OP specifies that the list in question is very large, whatever that means, but I take it to be larger than 10 and 100 and probably 1000. So the benchmark should be done against larger lists with timeit
doing fewer iterations so that it completes in a reasonable amount of time. That leads to the following behchmark:
import timeit
import time
import random
import string
from random import randint
from random import sample
data_values = string.ascii_lowercase + string.digits
len_data_values = len(data_values)
random_lower_limit = 0
random_upper_limit = len_data_values - 1
def normal_search_multiple_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
s = set(tuples_to_search)
for t in big_tuple_list:
if t in s:
return t
def intersect_search_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
tuples_to_search_set = set(tuples_to_search)
big_tuple_set = set(big_tuple_list)
return tuples_to_search_set.intersection(big_tuple_set)
def get_big_data(big_tuple_list_size):
big_tuple_list = [(
data_values[randint(random_lower_limit, random_upper_limit)],
data_values[randint(random_lower_limit, random_upper_limit)],
data_values[randint(random_lower_limit, random_upper_limit)]
) for _ in range(big_tuple_list_size)]
return big_tuple_list
def get_small_data(big_tuple_list, tuples_to_search_size):
return sample(big_tuple_list, tuples_to_search_size)
def get_required_data(big_tuple_list_size, tuples_to_search_size):
big_tuple_list = get_big_data(big_tuple_list_size)
tuples_to_search = get_small_data(big_tuple_list, tuples_to_search_size)
return (big_tuple_list, tuples_to_search)
if __name__ == '__main__':
test_combinations = [(10000, 5), (20000, 5), (100000, 5)]
functions_to_test = [
normal_search_multiple_tuples_in_tuples_list,
intersect_search_tuples_in_tuples_list
]
for big_tuple_list_size, tuples_to_search_size in test_combinations:
tuples_to_search, big_tuple_list = get_required_data(big_tuple_list_size, tuples_to_search_size)
print(f'For a run of searching {tuples_to_search_size} needles in {big_tuple_list_size} size haystack')
for func in functions_to_test:
print(f'''Time taken by {func.__name__}: {timeit.timeit('func(tuples_to_search, big_tuple_list)', number=1000, globals=globals())}''')
print()
Prints:
For a run of searching 5 needles in 10000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.681931
Time taken by intersect_search_tuples_in_tuples_list: 0.5806513
For a run of searching 5 needles in 20000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.8495027999999998
Time taken by intersect_search_tuples_in_tuples_list: 0.8799418999999999
For a run of searching 5 needles in 100000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 8.51018
Time taken by intersect_search_tuples_in_tuples_list: 8.115366799999999
There isn't a big difference in timings. My own benchmark where one of the sought items is a middle element of the list shows that not converting the list out-performed considerably converting the list to a set. But the big difference between this benchmark and mine is that my benchmark does not convert the tuples_to_search
list to a set; this set is pre-created since the OP knows what they are looking for. If we remove the cost of creating that set from the benchmark we have:
import timeit
import time
import random
import string
from random import randint
from random import sample
data_values = string.ascii_lowercase + string.digits
len_data_values = len(data_values)
random_lower_limit = 0
random_upper_limit = len_data_values - 1
def normal_search_multiple_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
for t in big_tuple_list:
if t in tuples_to_search:
return t
def intersect_search_tuples_in_tuples_list(tuples_to_search, big_tuple_list):
big_tuple_set = set(big_tuple_list)
return tuples_to_search.intersection(big_tuple_set)
def get_big_data(big_tuple_list_size):
big_tuple_list = [(
data_values[randint(random_lower_limit, random_upper_limit)],
data_values[randint(random_lower_limit, random_upper_limit)],
data_values[randint(random_lower_limit, random_upper_limit)]
) for _ in range(big_tuple_list_size)]
return big_tuple_list
def get_small_data(big_tuple_list, tuples_to_search_size):
return sample(big_tuple_list, tuples_to_search_size)
def get_required_data(big_tuple_list_size, tuples_to_search_size):
big_tuple_list = get_big_data(big_tuple_list_size)
tuples_to_search = get_small_data(big_tuple_list, tuples_to_search_size)
return (big_tuple_list, tuples_to_search)
if __name__ == '__main__':
test_combinations = [(10000, 5), (20000, 5), (100000, 5)]
functions_to_test = [
normal_search_multiple_tuples_in_tuples_list,
intersect_search_tuples_in_tuples_list
]
for big_tuple_list_size, tuples_to_search_size in test_combinations:
tuples_to_search, big_tuple_list = get_required_data(big_tuple_list_size, tuples_to_search_size)
print(f'For a run of searching {tuples_to_search_size} needles in {big_tuple_list_size} size haystack')
s = set(tuples_to_search)
for func in functions_to_test:
print(f'''Time taken by {func.__name__}: {timeit.timeit('func(s, big_tuple_list)', number=1000000, globals=globals())}''')
print()
Prints:
For a run of searching 5 needles in 10000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.16306289999999998
Time taken by intersect_search_tuples_in_tuples_list: 0.6508408
For a run of searching 5 needles in 20000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.16933260000000006
Time taken by intersect_search_tuples_in_tuples_list: 0.6473307000000001
For a run of searching 5 needles in 100000 size haystack
Time taken by normal_search_multiple_tuples_in_tuples_list: 0.1777871999999996
Time taken by intersect_search_tuples_in_tuples_list: 0.7130949000000002
And you can plainly see that you should not convert the list to a set. Finally, if that's not convincing, here is my benchmark where I do the set conversion of the list elements we are trying to match within the search functions and the element in the list we are searching that does match one of these elements is the last element. This should give the maximum advantage to the method that uses set intersection:
import timeit
def s1(s, lst):
s = set(s)
for i in lst:
if i in s:
return i
def s2(s, lst):
s = set(s)
s2 = set(lst)
return s & s2
lst = [(i,i) for i in range(20000)]
s = [(19999, 19999), (21000, 21000)] # match will be last element of lst
# look for a s in lst:
print(timeit.timeit('s1(s, lst)', number=1000, globals=globals()))
print(timeit.timeit('s2(s, lst)', number=1000, globals=globals()))
Prints:
0.887862600000517
1.8508867000000464