3

I have implemented a Levenshtein distance algorithm using a trie tree, as described here by Steve Hanov. However, I'm having difficulty handling special characters. For instance, if I calculate the distance between Großmann and Grossmann, I need the distance to be zero, since ß and ss should be considered equal.

What would be the best solution (if any) to support these special cases.

My initial thought was to kind of normalize all strings before calculating the distance. So in Großmann -> Grossman, österreich -> oesterreich, ... However, there seems to be no such functionality in .NET?

M.Babcock
  • 18,753
  • 6
  • 54
  • 84
Kevin Meiresonne
  • 85
  • 1
  • 1
  • 6

2 Answers2

1

The challenge is that the current culture does not identify the language of the individual words.

Assume you are willing to error on the side of match.

Identify a set of characters that never need to be mapped.

Identify a set mapping for all cultures.

Identify mappings for specific cultures.

First do an unmapped Levenshtein distance.

If the unmapped distance is is zero then stop.

If the unmapped distance is greater than x (e.g. 4) then stop as it is not a match.

If the word only has characters that never needs to be mapped (e.g. a-z) then stop.

Map both to all cultures and if the distance is zero stop.

Map to the default culture and if the distance is zero stop.

Map to other cultures and if the distance is zero stop.

And I added a straight string.compare to the Levenshtein to report 0 if true.

paparazzo
  • 44,497
  • 23
  • 105
  • 176
0

I think normalization is the way to go.

I'm not aware of any library that does this off-the-shelf, and a quick search didn't turn up anything.

A similar issue is discussed here: Converting "Bizarre" Chars in String to Roman Chars.

Their solution, to manually create a mapping will work, as long as you can comprehensively identify all the necessary mappings in advance.

Community
  • 1
  • 1
Daniel Renshaw
  • 33,729
  • 8
  • 75
  • 94
  • Yes, but that whould be very cumbersome. For instance, when comparing german, it should replace ö by oe, however, it should not normalize this when comparing in for instance an english culture. – Kevin Meiresonne May 23 '12 at 13:17
  • @KevinMeiresonne I don't see why that should be the case but, if necessary, build a separate mapping for each culture. – Daniel Renshaw May 23 '12 at 13:19
  • If the culture is English you don't want to convert ö to oe? That does not make sense to me. – paparazzo May 23 '12 at 13:31
  • Let me give a more accurate example: In German österreich is the same as oesterreich (so levenshtein distance = 0) In Dutch römer is not the same as roemer (so levenshtein distance != 0) – Kevin Meiresonne May 23 '12 at 13:41
  • @KevinMeiresonne I see; thanks for the example. In that case I think a per-culture mapping is the way to go although I acknowledge that creating those mappings could be quite a task to take on. Maybe you could release it as open source if you do it, seeing as it seems nobody else has beat you to it? – Daniel Renshaw May 23 '12 at 13:56
  • Understand. Challenge is the the current culture does not identify the language of origin of the individual words. Now I get what you mean by normalize. That would not be easy. – paparazzo May 23 '12 at 13:58