Calculating similarity "score" between multiple dictionaries

Question

I have a reference dictionary, "dictA" and I need to compare it (calculate similarity between key and vules) to n amount of dictionaries that are generated on the spot. Each dictionary has the same length. Lets say for the sake of the discussion that the n amount of dictionaries to compare it with is 3: dictB, dictC, dictD.

Here is how dictA looks like:

dictA={'1':"U", '2':"D", '3':"D", '4':"U", '5':"U",'6':"U"}

Here are how dictB,dictC and dictD look like:

dictB={'1':"U", '2':"U", '3':"D", '4':"D", '5':"U",'6':"D"}
dictC={'1':"U", '2':"U", '3':"U", '4':"D", '5':"U",'6':"D"}
dictD={'1':"D", '2':"U", '3':"U", '4':"U", '5':"D",'6':"D"}

I have a solution, but just for the option of two dictionaries:

sharedValue = set(dictA.items()) & set(dictD.items())
dictLength = len(dictA)
scoreOfSimilarity = len(sharedValue)
similarity = scoreOfSimilarity/dictLength

My questions is: How can I iterate through n amount of dictionaries with dictA being a primary dictionary which I compare others with. The goal is to get a "similarity" value for each dictionary I am gonna iterate through against the primary dictionary.

Thanks for your help.

1) Are those `n` dictionaries present in a list somewhere? 2) How do you calculate the similarity score for multiple iterations (for instance average) ? — UltraInstinct, Oct 11 '16 at 22:19
Why not just loop through the list of dictionaries from B to D? Are you looking to meet specific performance or data structure restrictions while solving this problem? — Rahul Murmuria, Oct 11 '16 at 22:20
Just so you know, Python3 `dict.items()` already works with the `&` and the other set operators. It is not a list but a set-like object that is a view of the dictionary items. — juanpa.arrivillaga, Oct 11 '16 at 22:27
@SuperSaiyan - 1) yes, the list will be always on the input. The number of dictionaries can be random. Once, it can be 3 like in the example, in other situations it can be 100 of dictionaries to compare with 2)not sure if I follow :/ — lechiffre, Oct 12 '16 at 07:06
@RahulMurmuria - I am looking for the faster performer since I expect thousands of dict in the future. Maybe the dict is not the best for performance. What would you recommend? — lechiffre, Oct 12 '16 at 07:07
@lechiffre, I have posted an answer. Note the variable naming convention there. What you have used comes from Java, and for Python, the naming conventions are a bit different. — Rahul Murmuria, Oct 12 '16 at 16:55

score 1 · Answer 1 · edited Oct 11 '16 at 22:44

Here's a general structure -- assuming that you can generate the dictionaries individually, using each before generating the next. This sounds like what you might want. calculate_similarity would be a function containing your "I have a solution" code above.

reference = {'1':"U", '2':"D", '3':"D", '4':"U", '5':"U",'6':"U"}
while True:
    on_the_spot = generate_dictionary()
    if on_the_spot is None:
        break
    calculate_similarity(reference, on_the_spot)

If you need to iterate through dictionaries already generated, then you have to have them in an iterable Python structure. As you generate them, create a list of dictionaries:

victim_list = [
    {'1':"U", '2':"U", '3':"D", '4':"D", '5':"U",'6':"D"},
    {'1':"U", '2':"U", '3':"U", '4':"D", '5':"U",'6':"D"},
    {'1':"D", '2':"U", '3':"U", '4':"U", '5':"D",'6':"D"}
]
for on_the_spot in victim_list:
    # Proceed as above

Are you familiar with the Python construct generator? It's like a function that returns its value with a yield, not a return. If so, use that instead of the above list.

Rahul Murmuria · Answer 2 · 2016-10-12T16:59:14.617

Based on your problem setup, there appears to be no alternative to looping through the input list of dictionaries. However, there is a multiprocessing trick that can be applied here.

Here is your input:

dict_a = {'1': "U", '2': "D", '3': "D", '4': "U", '5': "U", '6': "U"}
dict_b = {'1': "U", '2': "U", '3': "D", '4': "D", '5': "U", '6': "D"}
dict_c = {'1': "U", '2': "U", '3': "U", '4': "D", '5': "U", '6': "D"}
dict_d = {'1': "D", '2': "U", '3': "U", '4': "U", '5': "D", '6': "D"}
other_dicts = [dict_b, dict_c, dict_d]

I have included @gary_fixler's map technique as similarity1, in addition to the similarity2 function that I will use for the loop technique.

def similarity1(a):
    def _(b):
        shared_value = set(a.items()) & set(b.items())
        dict_length = len(a)
        score_of_similarity = len(shared_value)
        return score_of_similarity / dict_length
    return _

def similarity2(c):
    a, b = c
    shared_value = set(a.items()) & set(b.items())
    dict_length = len(a)
    score_of_similarity = len(shared_value)
    return score_of_similarity / dict_length

We are evaluating 3 techniques here:
(1) @gary_fixler's map
(2) simple loop through the list of dicts
(3) multiprocessing the list of dicts

Here are the execution statements:

print(list(map(similarity1(dict_a), other_dicts)))
print([similarity2((dict_a, dict_v)) for dict_v in other_dicts])

max_processes = int(multiprocessing.cpu_count()/2-1)
pool = multiprocessing.Pool(processes=max_processes)
print([x for x in pool.map(similarity2, zip(itertools.repeat(dict_a), other_dicts))])

You will find that all 3 techniques produce the same result:

[0.5, 0.3333333333333333, 0.16666666666666666]
[0.5, 0.3333333333333333, 0.16666666666666666]
[0.5, 0.3333333333333333, 0.16666666666666666]

Note that, for multiprocessing, you have multiprocessing.cpu_count()/2 cores (with each core having hyper-threading). Assuming that you have nothing else running on your system, and your program has no I/O or synchronization needs (as is the case for our problem), you will often get optimum performance with multiprocessing.cpu_count()/2-1 processes, the -1 being for the parent process.

Now, to time the 3 techniques:

print(timeit.timeit("list(map(similarity1(dict_a), other_dicts))",
                    setup="from __main__ import similarity1, dict_a, other_dicts", 
                    number=10000))

print(timeit.timeit("[similarity2((dict_a, dict_v)) for dict_v in other_dicts]",
                    setup="from __main__ import similarity2, dict_a, other_dicts", 
                    number=10000))

print(timeit.timeit("[x for x in pool.map(similarity2, zip(itertools.repeat(dict_a), other_dicts))]",
                    setup="from __main__ import similarity2, dict_a, other_dicts, pool", 
                    number=10000))

This produces the following results on my laptop:

0.07092539698351175
0.06757041101809591
1.6528456939850003

You can see that the basic loop technique performs the best. The multiprocessing was significantly worse than the other 2 techniques, because of the overhead of creating processes and passing data back and forth. This does not mean that multiprocessing is not useful here. Quite the contrary. Look at the results for a larger number of input dictionaries:

for _ in range(7):
    other_dicts.extend(other_dicts)

This extends the dictionary list to 384 items. Here are the timing results for this input:

7.934810006991029
8.184540337068029
7.466550623998046

For any larger set of input dictionaries, the multiprocessing technique becomes the most optimum.

score 0 · Answer 3 · answered Oct 11 '16 at 22:36

If you stick your solution in a function, you can call it by name for any two dicts. Also, if you curry the function by breaking up the arguments across nested functions, you can partially apply the first dict to get back a function that just wants the second (or you could use functools.partial), which makes it easy to map:

def similarity (a):
    def _ (b):
        sharedValue = set(a.items()) & set(b.items())
        dictLength = len(a)
        scoreOfSimilarity = len(sharedValue)
        return scoreOfSimilarity/dictLength
    return _

Aside: the above can also be written as a single expression via nested lambdas:

similarity = lambda a: lambda b: len(set(a.items()) & set(b.items)) / len(a)

Now you can get the similarity between dictA and the remainder with a map:

otherDicts = [dictB, dictC, dictD]
scores = map(similarity(dictA), otherdicts)

Now you can use min() (or max(), or whatever) to get the best from the scores list:

winner = min(scores)

Warning: I have not tested any of the above.

please do not use "_" as a name of the function, even if it is an inside function. http://stackoverflow.com/questions/5893163/what-is-the-purpose-of-the-single-underscore-variable-in-python — lejlot, Oct 11 '16 at 22:43

score 0 · Answer 4 · answered Oct 12 '16 at 13:55

Thanks to everyone for participation on the answer. Here is result that does what I need:

def compareTwoDictionaries(self, absolute, reference, listOfDictionaries):
    #look only for absolute fit, yes or no
    if (absolute == True):
        similarity = reference == listOfDictionaries
    else:
        #return items that are the same between two dictionaries
        shared_items = set(reference.items()) & set(listOfDictionaries.items())
        #return the length of the dictionary for further calculation of %
        dictLength = len(reference)
        #return the length of shared_items for further calculation of %
        scoreOfSimilarity = len(shared_items)
        #return final score: similarity
        similarity = scoreOfSimilarity/dictLength
    return similarity

Here is the call of the function

for dict in victim_list:
                output = oandaConnectorCalls.compareTwoDictionaries(False, reference, dict)

"Reference" dict and "victim_list" dict are used as described above.

Calculating similarity "score" between multiple dictionaries

4 Answers4