Based on your problem setup, there appears to be no alternative to looping through the input list of dictionaries. However, there is a multiprocessing trick that can be applied here.
Here is your input:
dict_a = {'1': "U", '2': "D", '3': "D", '4': "U", '5': "U", '6': "U"}
dict_b = {'1': "U", '2': "U", '3': "D", '4': "D", '5': "U", '6': "D"}
dict_c = {'1': "U", '2': "U", '3': "U", '4': "D", '5': "U", '6': "D"}
dict_d = {'1': "D", '2': "U", '3': "U", '4': "U", '5': "D", '6': "D"}
other_dicts = [dict_b, dict_c, dict_d]
I have included @gary_fixler's map technique as similarity1
, in addition to the similarity2
function that I will use for the loop technique.
def similarity1(a):
def _(b):
shared_value = set(a.items()) & set(b.items())
dict_length = len(a)
score_of_similarity = len(shared_value)
return score_of_similarity / dict_length
return _
def similarity2(c):
a, b = c
shared_value = set(a.items()) & set(b.items())
dict_length = len(a)
score_of_similarity = len(shared_value)
return score_of_similarity / dict_length
We are evaluating 3 techniques here:
(1) @gary_fixler's map
(2) simple loop through the list of dicts
(3) multiprocessing the list of dicts
Here are the execution statements:
print(list(map(similarity1(dict_a), other_dicts)))
print([similarity2((dict_a, dict_v)) for dict_v in other_dicts])
max_processes = int(multiprocessing.cpu_count()/2-1)
pool = multiprocessing.Pool(processes=max_processes)
print([x for x in pool.map(similarity2, zip(itertools.repeat(dict_a), other_dicts))])
You will find that all 3 techniques produce the same result:
[0.5, 0.3333333333333333, 0.16666666666666666]
[0.5, 0.3333333333333333, 0.16666666666666666]
[0.5, 0.3333333333333333, 0.16666666666666666]
Note that, for multiprocessing, you have multiprocessing.cpu_count()/2
cores (with each core having hyper-threading). Assuming that you have nothing else running on your system, and your program has no I/O or synchronization needs (as is the case for our problem), you will often get optimum performance with multiprocessing.cpu_count()/2-1
processes, the -1
being for the parent process.
Now, to time the 3 techniques:
print(timeit.timeit("list(map(similarity1(dict_a), other_dicts))",
setup="from __main__ import similarity1, dict_a, other_dicts",
number=10000))
print(timeit.timeit("[similarity2((dict_a, dict_v)) for dict_v in other_dicts]",
setup="from __main__ import similarity2, dict_a, other_dicts",
number=10000))
print(timeit.timeit("[x for x in pool.map(similarity2, zip(itertools.repeat(dict_a), other_dicts))]",
setup="from __main__ import similarity2, dict_a, other_dicts, pool",
number=10000))
This produces the following results on my laptop:
0.07092539698351175
0.06757041101809591
1.6528456939850003
You can see that the basic loop technique performs the best. The multiprocessing was significantly worse than the other 2 techniques, because of the overhead of creating processes and passing data back and forth. This does not mean that multiprocessing is not useful here. Quite the contrary. Look at the results for a larger number of input dictionaries:
for _ in range(7):
other_dicts.extend(other_dicts)
This extends the dictionary list to 384 items. Here are the timing results for this input:
7.934810006991029
8.184540337068029
7.466550623998046
For any larger set of input dictionaries, the multiprocessing technique becomes the most optimum.