1

Say I have a python dictionary:

d = {"a":1, "b":2}

this represents the number of occurrences a character has in a string. So the above dictionary could generate a string of "abb", "bab", or "bba".

The Max similarity between two dictionaries is a ratio >= 0 and <= 1 that describes how similar the two most similarly generated strings are.

For example,

d1 = {"a":1, "b":2}
d2 = {"c": 3}
d3 = {"a":1, "d":2}

max_sim(d1, d2) # equals to 0.0 because no indexes 
# of an arrangement of ccc matches any indexes of an arrangement of abb
max_sim(d1, d3) # equals to 0.333 because an arrangement of add matches
# one out of three characters of an arrangement of abb
# note that if we compared dda and abb, the similarity ratio would be 0.0
# but we always take into account the most similarly generated strings

How can I generate the max similarity of any two dictionaries (same length) simply by looking at the number of occurrences per character? i.e. simply analyze the dictionaries and not actually generate the strings and check the similarity ratio of each pair.

Note: I'm using max_sim on dictionaries rather than strings because I've already looped through two strings to gather their dictionary data (in addition to something else). If I use max_sim on two strings (either the original strings or convert the dictionaries back to strings), I figure I'd be just doing redundant computation. So I'd appreciate it if the answer took two dictionaries as inputs.

Derek
  • 11,980
  • 26
  • 103
  • 162
  • 1
    Would a [Cosine similarity](http://en.wikipedia.org/wiki/Cosine_similarity) fit? The cosine similarity for the d1 and d3 is 0.2 in that case. – Martijn Pieters Feb 27 '13 at 23:52
  • Related: [Compute the similarity between two lists](http://stackoverflow.com/questions/14720324/14720386#14720386) – Martijn Pieters Feb 27 '13 at 23:56

1 Answers1

1

What about this:

def max_sim(d1, d2):
    # assume that's the same for both dicts
    length = sum(d1.values())
    matches = 0
    for letter in set(d1.keys() + d2.keys()):
        matches += min(d1.get(letter, 0), d2.get(letter, 0))
    return matches / float(length)

Result:

d1 = {"a":1, "b":2}
d2 = {"c": 3} 
d3 = {"a":1, "d":2}
d4 = {"a": 1, "b": 1, "c": 1 }

max_sim(d1, d2) # 0.0
max_sim(d1, d3) # 0.333
max_sim(d1, d4) # 0.666
max_sim(d1, d1) # 1.0
Manuel Ebert
  • 8,429
  • 4
  • 40
  • 61
  • +1. I suspect this is what the questioner wants, but it should be noted that this only works if the sums of the values in all the dictionaries are the same. Otherwise you'll get a different result for `max_sim(d1,d2)` than for `max_sim(d2,d1)`. – Blckknght Feb 28 '13 at 00:04
  • Correct, but I assumed this is what the questioner meant with "same length". I think the general solution would be `length = max(sum(d1.values()), sum(d2.values()))` – Manuel Ebert Feb 28 '13 at 00:05