1

I have a list: list = ['item1', 'item2', 'item3', 'item4']

I want to compare the similarity of all items.

If item2 and item3 is similar, the result become list = ['item1', 'item2', 'item4']

Edit:

Sorry for my confusing question.

list items is set of trigrams. I want to remove the similar item in a list.

list = [('very','beauty','place'),('very','good','place'),('another','trigram','item')]

with compute jaccard similarity every pairs-item in that list, if jaccard score of pairs-item > 0.4, i call it similar. In this example, item1 and item2 are similar. The last output i want is:

list = [('very','beauty','place'),('another','trigram','item')]

This is the method to calculate jaccard score:

def compute_jaccard_index(set_1, set_2):
   n = len(set_1.intersection(set_2))
   return n / float(len(set_1) + len(set_2) - n)
Fahmi Rizal
  • 137
  • 2
  • 9

4 Answers4

2

If these items are strings or numbers you are looking for the set builtin.

In example:

In [1]: foo = [1, 32, 4, 5, 6, 5]

In [2]: set(foo)
Out[2]: {1, 4, 5, 6, 32}

In [3]: list(set(foo))
Out[3]: [32, 1, 4, 5, 6]

Depends what you mean by similar really.

Mahdi Yusuf
  • 19,931
  • 26
  • 72
  • 101
2

This will work if you have a similarity function instead of a straight equality comparison:

itemsToRemove = []
n = len(list)
for i in range(n):
  for j in range(i+1,n):
      if(similarTest(list[i], list[j]):
        itemsToRemove.append(list[i])
        break
return [item for item in list if item not in itemsToRemove]

Of course if you were actually looking to remove identical items, as others have suggested, then sets will work great.

Tom Swifty
  • 2,864
  • 2
  • 16
  • 25
  • This would almost certainly be better with some kind of hashing solution and a dict. Also, might be nicer to use itertools to compute the cartesian product. – Marcin Sep 16 '13 at 19:09
  • 2
    @Marcin You can’t hash similiarity between two individual elements. – poke Sep 16 '13 at 19:10
  • @poke No, actually you can. That's what a hashing function does. – Marcin Sep 16 '13 at 19:13
  • @Marcin: For example suppose I define that two strings are "similar" if their Levenshtein distance is less than or equal to 2. If similarity is not an equivalence relation then hashing doesn't preserve it (although there are some approximations using hashes). In that case you could do better than this, for example using a k-d tree. But if you don't know anything about the structure of the similarity, if it's just a black box function you call, then this is about as good as it gets. – Steve Jessop Sep 16 '13 at 19:16
  • @SteveJessop That's all true. However, equivalence relations are the more usual case; and in that case you can indeed "hash similarity between two individual elements". – Marcin Sep 16 '13 at 19:23
2

This solution will continue to look at pairs of two elements until it has looked at all pairs without filtering any. It’s not an effective solution as it will continue to look at the same pairs over and over again, and it also does not make use of a possible transitivity. But it’s a start.

>>> from itertools import combinations
>>> def filterSimilar (d):
        while True:
            filteredOne = False
            for s, t in combinations(d, 2):
                if isSimilar(s, t):
                    d.remove(t)
                    filteredOne = True
                    break
            if not filteredOne:
                break
>>> d = ['asdf', 'asxf', 'foo', 'bar', 'baz']   
>>> filterSimilar(d)
>>> d
['asdf', 'foo', 'bar']

A possible example implementation for isSimilar is the following which uses the Levenshtein distance between two strings:

def levenshteinDistance (s, t):
    if len(s) == 0:
        return len(t)
    if len(t) == 0:
        return len(s)
    return min(levenshteinDistance(s[:-1], t) + 1, levenshteinDistance(s, t[:-1]) + 1, levenshteinDistance(s[:-1], t[:-1]) + (0 if s[-1] == t[-1] else 1))

def isSimilar (s, t):
    return levenshteinDistance(s, t) < 2

(Note that the Levenshtein distance I used in this example is not an example for a transitive comparison)


Using your compute_jaccard_index function, the isSimilar function now looks like this:

def isSimilar (s, t):
    return compute_jaccard_index(s, t) > .4

And then used on your example data:

>>> lst = [{'very','beauty','place'},{'very','good','place'},{'another','trigram','item'}]
>>> filterSimilar(lst)
>>> lst
[{'very', 'beauty', 'place'}, {'item', 'trigram', 'another'}]
poke
  • 369,085
  • 72
  • 557
  • 602
  • in `def filterSimilar(d)` miss the `return d` – Fahmi Rizal Sep 16 '13 at 19:58
  • any other way to make it faster? – Fahmi Rizal Sep 16 '13 at 20:10
  • It’s not missing the return, because it changes the passed list in place. A simple way to make it faster would be to remember which comparisons you already processed; or instead of restarting whenever something is removed, you could remember that instead, skip it whenever a future comparison touches it and finally delete it at the end. – poke Sep 16 '13 at 20:17
0

You can use set.It will remove all the duplicate element from the list.

>>>list = [1,2,3,4,4,5,2,3,1]
>>>list =set(list)
>>>list
set([1, 2, 3, 4, 5])
Kousik
  • 21,485
  • 7
  • 36
  • 59