13

I'd like to compute the similarity between two lists of various lengths.

eg:

listA = ['apple', 'orange', 'apple', 'apple', 'banana', 'orange'] # (length = 6)
listB = ['apple', 'orange', 'grapefruit', 'apple'] # (length = 4)

as you can see, a single item can appear multiple times in a list, and the lengths are of different sizes.

I've already thought of comparing the frequencies of each item, but that does not encompass the size of each list (a list that is simply twice another list should be similar, but not perfectly similar)

eg2:

listA = ['apple', 'apple', 'orange', 'orange']
listB = ['apple', 'orange']
similarity(listA, listB) # should NOT equal 1

So I basically want to encompass the size of the lists, and the distribution of items in the list.

Any ideas?

kmace
  • 1,994
  • 3
  • 23
  • 39
  • 5
    Those are lists, not sets. – Martijn Pieters Feb 06 '13 at 01:58
  • By `similarity` do you mean to create a third list that contains the elements that appear in both listA and listB? so that the result in your case would be `['apple', 'orange']`? – Konsol Labapen Feb 06 '13 at 04:21
  • by similarity I mean some measure of how similar they are. so a comparing 2 identical sets (or list) would give you a score of 1, and 2 completely dissimilar sets would give you zero. these sets are, however different in size, and may contain repeating elements – kmace Feb 06 '13 at 05:12

3 Answers3

32

Use collections.Counter() perhaps; those are multi-sets, or bags, in datatype parlance:

from collections import Counter

counterA = Counter(listA)
counterB = Counter(listB)

Now you can compare these by entries or frequencies:

>>> counterA
Counter({'apple': 3, 'orange': 2, 'banana': 1})
>>> counterB
Counter({'apple': 2, 'orange': 1, 'grapefruit': 1})
>>> counterA - counterB
Counter({'orange': 1, 'apple': 1, 'banana': 1})
>>> counterB - counterA
Counter({'grapefruit': 1})

You can calculate their cosine similarity using:

import math

def counter_cosine_similarity(c1, c2):
    terms = set(c1).union(c2)
    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
    return dotprod / (magA * magB)

Which gives:

>>> counter_cosine_similarity(counterA, counterB)
0.8728715609439696

The closer to 1 that value, the more similar the two lists are.

The cosine similarity is one score you can calculate. If you care about the length of the list, you can calculate another; if you keep that score between 0.0 and 1.0 as well you can multiply the two values for a final score between -1.0 and 1.0.

For example, to take relative lengths into account you could use:

def length_similarity(c1, c2):
    lenc1 = sum(c1.itervalues())
    lenc2 = sum(c2.itervalues())
    return min(lenc1, lenc2) / float(max(lenc1, lenc2))

and then combine into a function that takes the lists as inputs:

def similarity_score(l1, l2):
    c1, c2 = Counter(l1), Counter(l2)
    return length_similarity(c1, c2) * counter_cosine_similarity(c1, c2)  

For your two example lists, that results in:

>>> similarity_score(['apple', 'orange', 'apple', 'apple', 'banana', 'orange'], ['apple', 'orange', 'grapefruit', 'apple'])
0.5819143739626463
>>> similarity_score(['apple', 'apple', 'orange', 'orange'], ['apple', 'orange'])
0.4999999999999999

You can mix in other metrics as needed.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • this kind of works, but if we look at the example where list c1 is just a double count of c2, then the similarity is still 1. so not exactly what I am looking for. thanks for the code though. – kmace Feb 06 '13 at 08:09
  • 1
    @kamula: This is a starting point; if the cos similarity is 1, see if one has a larger top count than the other (`.most_common(1)` on either) to adjust, etc. – Martijn Pieters Feb 06 '13 at 09:18
  • If you don't want the length-normalized score that the cosine distance provides, you could calculate the Euclidean distance between the two lists – duhaime Dec 16 '14 at 23:08
1

From a theoretical point of view : I recommend you look up cosine similarity http://en.wikipedia.org/wiki/Cosine_similarity

You may have to modify to fit your scheme, but the idea of cosine similarity is great.

Vigneshwaren
  • 1,273
  • 2
  • 15
  • 24
1

I believe what you are looking for is to counting the number of inversions in an array The question has your answer: Counting inversions in an array

Community
  • 1
  • 1
Computernerd
  • 7,378
  • 18
  • 66
  • 95
  • I'm sorry, but I'm not sure if i get what you mean. How can comparing two sets be translated into counting the number of inversions in an implementation of merge sort? – kmace Feb 06 '13 at 08:18