4

I have two lists: one, the interests of the user; and second, the keywords about a book. I want to recommend the book to the user based on his given interests list. I am using the SequenceMatcher class of Python library difflib to match similar words like "game", "games", "gaming", "gamer", etc. The ratio function gives me a number between [0,1] stating how similar the 2 strings are. But I got stuck at one example where I calculated the similarity between "looping" and "shooting". It comes out to be 0.6667.

for interest in self.interests:
    for keyword in keywords:
       s = SequenceMatcher(None,interest,keyword)
       match_freq = s.ratio()
       if match_freq >= self.limit:
            #print interest, keyword, match_freq
            final_score += 1
            break 

Is there any other way to perform this kind of matching in Python?

Anurag Sharma
  • 4,839
  • 13
  • 59
  • 101
  • 2
    Have a look to this lib : https://github.com/seatgeek/fuzzywuzzy – Freelancer Sep 18 '13 at 12:08
  • Heelo. Do you mean that 0.6667 is less than self.limit and therefore looping and shooting are declared unmatching together, while you would like a result giving a match between these two words ? Or the contrary ? – eyquem Sep 18 '13 at 12:09
  • @eyquem their is no similarity between looping and shooting, but the ratio function is giving a high match ... thats the concern – Anurag Sharma Sep 18 '13 at 12:12
  • What do you think of raising the value of self.limit ? – eyquem Sep 18 '13 at 12:15
  • @eyquem ratio() call on "gaming" and "game" is giving "0.6" as result, at the same time "looping" and "shooting" gives "0.66667" as result on calling ratio(), so increasing self.limit will not be helpful – Anurag Sharma Sep 18 '13 at 12:23
  • OK. My question was stupid. I should have tested the results between words you gave. - Your problem makes me realize that difflib is probably more adapted to test similarity between long strings than short ones. - I think the only solution is to add additional tests with the help of regular expressions. – eyquem Sep 18 '13 at 12:32
  • I think that resorting to ``NLTK``, as suggested by Xin and 2ero, is probably the keeniest way to obtain the convenient comparisons you want. My solution may be useful if you want to rapidly improve your code without having to study NLTK, but it isn't based on semantics, so it may fail on one case or another. – eyquem Sep 18 '13 at 13:21

3 Answers3

10

Firstly a word can have many senses and when you try to find similar words you might need some word sense disambiguation http://en.wikipedia.org/wiki/Word-sense_disambiguation.

Given a pair of words, if we take the most similar pair of senses as the gauge of whether two words are similar, we can try this:

from nltk.corpus import wordnet as wn
from itertools import product

wordx, wordy = "cat","dog"
sem1, sem2 = wn.synsets(wordx), wn.synsets(wordy)

maxscore = 0
for i,j in list(product(*[sem1,sem2])):
  score = i.wup_similarity(j) # Wu-Palmer Similarity
  maxscore = score if maxscore < score else maxscore

There are other similarity functions that you can use. http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html. The only problem is when you encounter words not in wordnet. Then i suggest you fallback on difflib.

alvas
  • 115,346
  • 109
  • 446
  • 738
  • 2
    Is there a reason for the `*[...]` in your product call? It seems like you could leave it out to get the same effect (the product of iterables `sem1` and `sem2`) with just `product(sem1, sem2)`. Am I missing something? – Blckknght Oct 10 '13 at 10:32
  • pardon the `*`, i was a little verbose in my code. hahahaa, so doing `product(*[sem1,sem2])` is effectively the same as `product(sem1,sem2)` but you can see that the first one is a single list made of `[sem1,sem2]` but the 2nd is passed as 2 paramaters `sem1, sem2`. – alvas Oct 10 '13 at 13:09
  • `issame = list(product(x,y)) == list(product(*[x,y]))` and you get `issame == True` =) – alvas Oct 10 '13 at 13:12
  • @alvas Just spitballing here... Don't you want to also do `j.wup_similarity(i)`? I found "hot", "warm" get different scores from "warm", "hot". – John Strood Sep 14 '18 at 13:50
  • See https://stackoverflow.com/questions/20075335/is-wordnet-path-similarity-commutative , note `wup_similarity` is based on path =) – alvas Sep 14 '18 at 14:45
5

At first, I thought to regular expressions to perform additional tests to discriminate the matchings with low ratio. It can be a solution to treat specific problem like the one happening with words ending with ing. But that's only a limited case and thre can be numerous other cases that would need to add specific treatment for each one.

Then I thought that we could try to find additional criterium to eliminate not semantically matching words having a letters simlarity ratio enough to be detected as matcging together though the ratio is low,
WHILE in the same time catching real semantically matching terms having low ratio because they are short.

Here's a possibility

from difflib import SequenceMatcher

interests = ('shooting','gaming','looping')
keywords = ('loop','looping','game')

s = SequenceMatcher(None)

limit = 0.50

for interest in interests:
    s.set_seq2(interest)
    for keyword in keywords:
        s.set_seq1(keyword)
        b = s.ratio()>=limit and len(s.get_matching_blocks())==2
        print '%10s %-10s  %f  %s' % (interest, keyword,
                                      s.ratio(),
                                      '** MATCH **' if b else '')
    print

gives

  shooting loop        0.333333  
  shooting looping     0.666667  
  shooting game        0.166667  

    gaming loop        0.000000  
    gaming looping     0.461538  
    gaming game        0.600000  ** MATCH **

   looping loop        0.727273  ** MATCH **
   looping looping     1.000000  ** MATCH **
   looping game        0.181818  

Note this from the doc:

SequenceMatcher computes and caches detailed information about the second sequence, so if you want to compare one sequence against many sequences, use set_seq2() to set the commonly used sequence once and call set_seq1() repeatedly, once for each of the other sequences.

eyquem
  • 26,771
  • 7
  • 38
  • 46
  • 2
    it's a nice idea to think about for `stemming`. Thanks for sharing. – alvas Sep 19 '13 at 17:25
  • @2ero Sorry, english isn't my mother language and I've sometimes difficulties to understand. Presently I don't understand what you want to mean with ``stemming`` – eyquem Sep 19 '13 at 18:17
  • no worries, it's a technique in `information retrieval`, http://en.wikipedia.org/wiki/Stemming – alvas Sep 20 '13 at 04:29
3

Thats because SequenceMatcher is based on edit distance or something alike. semantic similarity is more suitable for your case or hybrid of the two.

take a look NLTK pack (code example) as you are using python and maybe this paper

for people using c++ can check this open source project for reference

Community
  • 1
  • 1
Xin
  • 575
  • 4
  • 20
  • 1
    +1 for recommending NLTK! Its perfect for this type of usage. – aquavitae Sep 18 '13 at 14:18
  • [this question](http://stackoverflow.com/questions/16877517/compare-similarity-of-terms-expressions-using-nltk/16922499#16922499) or [this other question](http://stackoverflow.com/questions/6704499/algorithm-to-compare-similarity-of-english-sentences) might help with the semantic similarity bits – arturomp Sep 18 '13 at 18:21