18

E.g. the Soundex algorithm is optimized for English. Is there a more universal algorithm that would apply across large families of languages?

Dori
  • 915
  • 1
  • 12
  • 20
torial
  • 13,085
  • 9
  • 62
  • 89

1 Answers1

16

SOUNDEX is indeed English-oriented. Two others that take a wider variety of phonetic differences into account are: Double Metaphone and NYSIIS.

They produce encodings into a much larger possible space than SOUNDEX does. Double Metaphone, specifically, includes reductions with the express purpose of handling alternate pronunciations based on more languages than English.

I did a presentation on fuzzy string matching recently, the slides may be helpful.

Kyle Burton
  • 26,788
  • 9
  • 50
  • 60
  • 1
    The link to your slides is broken (404) – John Machin Sep 26 '09 at 05:29
  • 2
    @John: new link seems to be http://asymmetrical-view.com/talks/#fuzzy-string-matching – Hace Mar 04 '11 at 08:54
  • 1
    Thanks, I just updated it to point to the PDF in the related github repo - I hope that stays more constant. Thanks. – Kyle Burton Mar 09 '11 at 22:46
  • On Slide 38, you're showing percentage similarities that are above %50 - I'm not saying it's wrong, but what formula are you using to calculate the similarity percentage from the edit distance? The formula I've seen `1 / (1 + dist)` maxes out at 50% for inexact matches. I know your costs are variable, but `1 / 1.4 != %93` which is the number you show in your slide. Thanks! – Jason Kleban Mar 25 '11 at 11:56
  • I may not have the version you do - for me slide 38 is an edit distance grid :( Which words are being compared that you're looking at? The distance formula I usually use is (max(len(a),len(b)) - num_edits) / max(len(a),len(b)). If you're looking at the Text Brew algorithm, it allows different costs for the various edits, I'm pretty sure I used the same formula - there is sample code in the github repo...if you can tell me what's on the slide in question I can probably better answer your question...or email me and we'll figure it out. – Kyle Burton Mar 26 '11 at 22:38