1

I'm trying to find a string metric to find the most similar entry in my list to an arbitrary input. It looks like most common string metrics place heavy weight on extraneous characters, even if a substring matches perfectly. For example, 'Corvette, red 2013' and 'corvette' have a match store of 0.11 using difflib.get_close_matches() but 'octet rev' and 'corvette' have a match score of 0.23.

I know my list will likely have extraneous information (like 'red 2013') but I am more interested in knowing that 'corvette' is an exact match while ignoring that extraneous information. 'Octet rev' would count as a false match for my purposes.

Are there any string match metrics that weigh the match in the way that I need? Even better, is there one already implemented in a python package?

ericksonla
  • 1,247
  • 3
  • 18
  • 34
  • You're actually trying to solve two problems: longest common substring with shortest edit distance. That's pretty non-trivial and an area of active research as you can see by the results you get in google. Those papers are not cheap either. – BeyelerStudios Jul 07 '16 at 03:28
  • What should work in your case is breaking up your query and your list entries into separate tokens, match each pairs and find the best match according to the entry with the best {sum/avg/other score} of matched query tokens. Thus you ignore extraneous tokens that you didn't query for. – BeyelerStudios Jul 07 '16 at 03:43

0 Answers0