4

I have a list of strings as a query and a few hundrends of other lists of strings. I want to compare the query with every other list and extract a similarity score between them.

Example:

query = ["football", "basketball", "martial arts", "baseball"]

list1 = ["apple", "football", "basketball court"]

list2 = ["ball"]

list3 = ["martial-arts", "baseball", "banana", "food", "doctor"]

What I am doing now and I am not satisfied with the results is an absolute compare of them.

score = 0
for i in query:
   if i in list1:
      score += 1

score_of_list1 = score*100//len(list1)

I found a library that may help me fuzzywuzzy, but I was thinking if you have any other way to suggest.

Tasos
  • 7,325
  • 18
  • 83
  • 176
  • I don't understand what's wrong with your solution. What sort of comparison do you want? Perhaps give an example of a result you'd accept and how it's better. – Reut Sharabani Mar 11 '14 at 09:51
  • As I said, since it is an absolute comparison, I don' have high scores. So, I am looking for another sollution which will maybe increase the score. – Tasos Mar 11 '14 at 09:52
  • What you mean is you want a way to compare the strings, not the lists. Am I right? – Reut Sharabani Mar 11 '14 at 10:01
  • Yes, but the final score will be between the lists. Just like the example in my question. I compare the strings and then I calculate a score between the lists in base of the string comparison results. – Tasos Mar 11 '14 at 10:05

1 Answers1

6

If you're looking for a way to find similarity between strings, this SO question suggests Levenshtein distance as a method of doing so.

There is a solution ready, and it also exists in the Natural Language Tool Kit library.

The naive integration would be (I use random merely to have a result. It doesn't make sense obviously):

#!/usr/bin/env python
query = ["football", "basketball", "martial arts", "baseball"]
lists = [["apple", "football", "basketball court"], ["ball"], ["martial-arts", "baseball", "banana", "food", "doctor"]]
from random import random

def fake_levenshtein(word1, word2):
    return random()

def avg_list(l):
        return reduce(lambda x, y: x + y, l) / len(l)

for l in lists:
    score = []
    for w1 in l:
        for w2 in query:
            score.append(fake_levenshtein(w1, w2))
    print avg_list(score)

Good luck.

Community
  • 1
  • 1
Reut Sharabani
  • 30,449
  • 6
  • 70
  • 88
  • Thank you. I will wait for a while and if I do not have any other answers, I will choose yours! – Tasos Mar 11 '14 at 10:32