1

I need a function that would compare two strings and outputs an edit distance like Levenshtein, but only if the characters are homoglyphs in cursives. I have a list of those homoglyphs so I could feed a custom list to this function.

Example

homoglyphs = [["o","a"],["rn","m","nn"],...] // In cursive they look-alike

compare("Mory", "Mary", homoglyphs) // Levenshtein gives 1
compare("Mory", "Tory", homoglyphs) // Levenshtein gives 1, but I want false, 99 or -1

compare("Morio", "Mario", homoglyphs) // I expect a distance of 1
compare("Morio", "Maria", homoglyphs) // I expect a distance of 2

Tory should give a false result since there's no way someone misread an M as a T. An A could be misread as an O so it can count as 1.

The scoring could be different, I just need to know that Mory is probably Mary not Tory and Morio is a little more likely to be Mario than Maria.

Do something like this exists?

James McGrath
  • 225
  • 3
  • 11

1 Answers1

1

The key to your problem can be thought of like an IQ word association question.

  Sound       Glyph
--------- =  ----------
Homophone    Homoglyphs

Now if you know that there is a way to find similar sounding words (homophone) then the same can be applied but instead of sounds change to glyphs (homoglyph).

The way to find similar sounding words is via Soundex (Sound Index).

So just do what Soundex does but instead of having a mapping from similar homophones use similar homoglyphs.

Once you convert each word (glyphs) input into a Glyphdex (Glyph Index) then you can compute the Levenshtein distance for the two Glyphdex.

Make sense?


If you are into cellular biology then codon translation into amino acids (ref) might make more sense. Many amino acids are coded by more than one 3 letter codon.


Note: Since the word glyhdex has been used prior to me writing this I can not say I coined that word, however the usage I currently find via Google (search) for the word are not in the same context as described here. So in the context of converting a sequence of glyphs into an index of similar sequence of glyphs I will take credit.

Guy Coder
  • 24,501
  • 8
  • 71
  • 136
  • I hoped that some library somewhere had this function already, but I understand it's quite a rare use case. I was going to code a custom Levenshtein, but your Glyphdex idea would probably be more performant, I'll try that. Thanks! – James McGrath Jun 27 '22 at 13:34
  • For preexisting code in a library for this I would search spell checkers. They would certainly need something similar. Since you noted `Levenshtein` I didn't look as that is very specific algorithm and it might not be used in such libraries. – Guy Coder Jun 27 '22 at 21:17
  • I tried using a few spell checkers and they don't handle homoglyphs very well. They work more for typing errors than an "ln" that looked like a "h" in cursive. Even worst with names. – James McGrath Jun 28 '22 at 12:30
  • @JamesMcGrath Thanks for the feedback, was worth a try. Another idea now that I know you also seek `In` that looks like `h` would be Optical Character Recognition (OCR). That sort of problem comes up often when trying to do OCR on an old photo image copy that has speckles on it or is is tilted out of alignment. However that would not be an algorithm as much as a neural network. – Guy Coder Jun 28 '22 at 13:14
  • @JamesMcGrath Another related problem is with PDFs containing [ligatures](https://en.wikipedia.org/wiki/Ligature_(writing)) do to incorrect OCR. I can not tell you how many times I will see a word search fail because characters were converted to a ligature and the search will only match exactly. – Guy Coder Jun 28 '22 at 13:20