4

I'm doing an application that computers a large list of brands/domains and detects variations from pre-determined keywords.

Examples:

facebook vs facebo0k.com
linkedIn vs linkedln.com
stackoverflow vs stckoverflow

I'm wondering if for the simply purpose of comparing two strings and detect subtle variations, both algorithms meet the purpose so there is not added value of choosing one over another unless it's for performance improvement?

Andre
  • 598
  • 1
  • 7
  • 18
  • 2
    Probably these links will help: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance#Relationship_with_other_edit_distance_metrics or https://stackoverflow.com/questions/25540581/difference-between-jaro-winkler-and-levenshtein-distance – zforgo May 09 '20 at 16:38

3 Answers3

2

I would use Damerau–Levenshtein with the added twist that the cost of substitution for common misspellings ('I' vs 'l', '0' vs 'O') or mistypings ('Q' vs 'W' etc.) would be lower.

maniek
  • 7,087
  • 2
  • 20
  • 43
2

The Smith-Waterman algorithm is probably going to be more adapted to your task, since it allows you to define a score function that will reflect what you consider to be a 'similarity' between characters (for instance O is quite similar to 0 etc).
I think that it has the advantage of allowing you to define your own score function, which is not necessarily the case with the Vanilla version of the other algorithms you present.

This algorithm is widely used in bioinformatics, where biologists try to detect DNA sequences that may be different, but have the same, or very similar, functionalities (for instance, that AGC codes the same protein than GTA).

The algorithm runs in quadratic time using dynamic programming, and is fairly easy to implement.

m.raynal
  • 2,983
  • 2
  • 21
  • 34
  • Jaro-Winkler also provides a normalized score, whats the difference with the Jaro-Winkler then? – Exploring Sep 02 '20 at 05:18
  • Any of the above can be adapted in order to provide a normalized score. The difference between Smith-Waterman and Jaro-Wrinkler is that the former gives much more freedom in the choice of the score function, and has less priors, while the latter takes in account the transpositions. – m.raynal Sep 02 '20 at 10:49
1

If you are only considering either Levenshtein or Jaro-Winkler distances then you will probably want to go with Jaro-Winkler since it takes into account only matching characters and any required transpositions (swapping of characters) and is a value between zero and one and will be equal to 1 (no similarity) if there are no closely matching characters (making it easy to filter out any obvious non-matches).

A Levenshtein distance will give a value for any arbitrarily distant pair of strings no matter how different they are, requiring you to choose a cutoff threshold of what to consider.

However, Jaro-Winkler gives extra weight to prefix similarity (matching characters near the beginning of the strings). If this isn't desired, than regular Jaro distance might be what you want.

Exploring
  • 2,493
  • 11
  • 56
  • 97
Arlo Clarke
  • 315
  • 2
  • 10