My aim is to identify the top 10 most similar strings in a large list of strings, given an input string. This is for a web-based API so I would need a very fast response time (<100ms is ideal).
I am operating on Python but can be flexible if there is a better way to achieve this (be it through Bash scripts or another language). I have tried a variety of approaches so far, including difflib and fuzzywuzzy but the results returned have either been too slow or not returning ideal results.
In the case of my most recent experiment, the solution using fuzzywuzzy returns an un-ideal result set.
An input of "coupons" returns:
Query Stem Key compare score
34046 copson copson 1 coupon 83
35011 couponcom couponcom 1 coupon 80
61206 groupon groupon 1 coupon 77
5834 able coupon abl coupon 1 coupon 75
124231 rjr coupons rjr coupon 1 coupon 75
35026 couponscom couponscom 1 coupon 75
34991 couples coupl 1 coupon 73
34993 couples dominated coupl domin 1 coupon 71
11236 arbys coupon arbi coupon 1 coupon 71
While there are related keywords in the result set,non-related keywords ("couples")I know that better results the exist in the large corpus of strings are being overlooked due to the fact they have a larger edit distance between it and the original query. E.g. "birthday coupon", "supermarket coupons"
Is there an approach that does something like n-gram matching to ensure a more relevant similarity? Speed is the high priority here, I deal with issues such as mis-spellings and stemming ahead of time.