0

My aim is to identify the top 10 most similar strings in a large list of strings, given an input string. This is for a web-based API so I would need a very fast response time (<100ms is ideal).

I am operating on Python but can be flexible if there is a better way to achieve this (be it through Bash scripts or another language). I have tried a variety of approaches so far, including difflib and fuzzywuzzy but the results returned have either been too slow or not returning ideal results.

In the case of my most recent experiment, the solution using fuzzywuzzy returns an un-ideal result set.

An input of "coupons" returns:

                     Query         Stem  Key compare  score
34046              copson       copson    1  coupon     83
35011           couponcom    couponcom    1  coupon     80
61206             groupon      groupon    1  coupon     77
5834          able coupon   abl coupon    1  coupon     75
124231        rjr coupons   rjr coupon    1  coupon     75
35026          couponscom   couponscom    1  coupon     75
34991             couples        coupl    1  coupon     73
34993   couples dominated  coupl domin    1  coupon     71
11236        arbys coupon  arbi coupon    1  coupon     71

While there are related keywords in the result set,non-related keywords ("couples")I know that better results the exist in the large corpus of strings are being overlooked due to the fact they have a larger edit distance between it and the original query. E.g. "birthday coupon", "supermarket coupons"

Is there an approach that does something like n-gram matching to ensure a more relevant similarity? Speed is the high priority here, I deal with issues such as mis-spellings and stemming ahead of time.

sophros
  • 14,672
  • 11
  • 46
  • 75
GreenGodot
  • 6,030
  • 10
  • 37
  • 66
  • If you're looking for n-gram matching, have you tested if the performance of the [ngram module](https://pythonhosted.org/ngram/ngram.html#ngram.NGram.search) is acceptable for you? If using n-grams is not a requirement, how do you define "relevant similarity"? – Nickolay Aug 23 '19 at 21:45

1 Answers1

0

String similarity is a broad topic in NLP so it is important to decide on the metric. fuzzywuzzy uses Levenshtein distance which is appropriate to deal with misspellings but does not address semantic similarity in any way. From what you write this is what you are looking for so I suggest that you use Word2Vec model via gensim library.

For this I suggest you refer to answer to this question: How to use word2vec to calculate the similarity distance by giving 2 words?

sophros
  • 14,672
  • 11
  • 46
  • 75