Levenshtein distance in Python - wrong result with national characters

Question

I found similar topic: Levenshtein distance on diacritic characters, but it's PHP and I write in Python. Still, problem remains the same. For instance: levenshtein(kot, kod) = 1 levenshtein(się, sie) = 2, which is wrong. Any ideas on how to solve this?

Are you using Python 2.7? Do you input or cast strings to unicode (e.g. `u"się"`, or `unicode(raw_input())`)? — Kolmar, Mar 24 '15 at 22:44
Python 2.7, coding utf-8 and sys.argv. Well, I just discovered that len(anything-with-national-characters) is longer than normal len(się) = 4. Just why? :( — user4598392, Mar 24 '15 at 22:48
I also tried: word1 = unicode(sys.argv[1]) and/or word1 = sys.argv[1] word1 = unicode(word1) and it stopped counting it as word (throws exception that I need two arguments and only gave one) — user4598392, Mar 24 '15 at 22:56

score 0 · Answer 1 · edited May 23 '17 at 11:56

0

First of all you have to make sure that the strings are both in unicode. For Python 3 you have that automatically, but in Python 2 you have to decode the strings to unicode type first. For example sys.argv[1].decode('utf-8'), if you know that the encoding in the console is UTF-8. You may try to guess this encoding with sys.stdin.encoding.

After that you may have to normalize unicode. For example unicode strings u'\u00c7' and u'\u0043\u0327' have the same representation Ç, but they would compare as non-equal, and would have non-zero levenshtein distance. To normalize strings you can use unicodedata.normalize function.

The script in Python 2 might look something like this:

import unicodedata
import sys
# import or define your levenshtein function here

def decode_and_normalize(s):
    return unicodedata.normalize('NFKC', s.decode('utf-8'))

s1 = decode_and_normalize(sys.argv[1])
s2 = decode_and_normalize(sys.argv[2])
print levenshtein(s1, s2)

And after all that you may still run into problems if the characters are outside Basic Multilingual Plane. On this issue look at this stackoverlow question.

edited May 23 '17 at 11:56

Community

1
1

answered Mar 24 '15 at 23:22

Kolmar

14,086
1
22
25

Thank you so much, it works! (I can't give upvotes due to small reputation). – user4598392 Mar 24 '15 at 23:33
Anyway, now I'm trying to slightly change the Leveshtein distance algorithm in a way that words with national characters and without them are treated as equal. For example: levenshtein(się, sie) = 0, levenshtein(jąkać, jakać) = 0 and so on. I created a list of national substitutions: [['ę', 'e'], ['ą', 'a'] ... ], but getting an error: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal. How to fix this? – user4598392 Mar 24 '15 at 23:37
@user4598392 I think you can still accept the answer. To have a unicode literal in Python 2 put `u` before the string. So the substitutions will be `[[u'ę', u'e'], [u'ą', u'a'] ... ]`. – Kolmar Mar 25 '15 at 08:37
I want to write thanks (after so long time) ;) And please, close the topic or sth, because problem is solved. – user4598392 Mar 31 '15 at 22:14

Levenshtein distance in Python - wrong result with national characters

1 Answers1