0

I recently answered a question, that in its comments section picked up a query from another user that I couldn't answer.

Searching for a product even if code is misspelled

Given a fuzzy search parameter which will use Regular Expressions to filter a 'large' datasource, how would you go about assigning a value for 'relevance' or 'best match'?

The filter will work correctly but I have no idea how to adapt it in such a way that you can identify what values are closest to the provided search string, and what values are farthest away.

Closest in this case would be an exact match to the string (assume the '+' character doesn't exist, anything that still matches is closest). Farthest, i.e. Worst, match would be exactly the opposite, largest number of non-matching characters.

For the sake of avoiding arguments, lets assume the fuzzy search being used is using a mix of '+' and '*' in the search patter. X+HG*UPO+Z* or something along those lines.

The goal is to avoid using a string length comparison. In the question I answered, the data was almost guaranteed to always be the same length anyway.

Community
  • 1
  • 1
Nevyn
  • 2,623
  • 4
  • 18
  • 32

1 Answers1

0

You could compute the Levenshtein distance, or something similar. Approximate string matching on Wikipedia might be of some help.

Qtax
  • 33,241
  • 9
  • 83
  • 121
  • Excellent, exactly what I was looking for. The LDistance calculation would give the degree of match between the search string and the located string. Thanks. – Nevyn Jun 11 '12 at 14:57