I found similar topic: Levenshtein distance on diacritic characters, but it's PHP and I write in Python. Still, problem remains the same. For instance: levenshtein(kot, kod) = 1 levenshtein(się, sie) = 2, which is wrong. Any ideas on how to solve this?
-
Are you using Python 2.7? Do you input or cast strings to unicode (e.g. `u"się"`, or `unicode(raw_input())`)? – Kolmar Mar 24 '15 at 22:44
-
Python 2.7, coding utf-8 and sys.argv. Well, I just discovered that len(anything-with-national-characters) is longer than normal len(się) = 4. Just why? :( – user4598392 Mar 24 '15 at 22:48
-
I also tried: word1 = unicode(sys.argv[1]) and/or word1 = sys.argv[1] word1 = unicode(word1) and it stopped counting it as word (throws exception that I need two arguments and only gave one) – user4598392 Mar 24 '15 at 22:56
1 Answers
First of all you have to make sure that the strings are both in unicode. For Python 3 you have that automatically, but in Python 2 you have to decode the strings to unicode
type first. For example sys.argv[1].decode('utf-8')
, if you know that the encoding in the console is UTF-8. You may try to guess this encoding with sys.stdin.encoding
.
After that you may have to normalize unicode. For example unicode strings u'\u00c7'
and u'\u0043\u0327'
have the same representation Ç, but they would compare as non-equal, and would have non-zero levenshtein distance. To normalize strings you can use unicodedata.normalize
function.
The script in Python 2 might look something like this:
import unicodedata
import sys
# import or define your levenshtein function here
def decode_and_normalize(s):
return unicodedata.normalize('NFKC', s.decode('utf-8'))
s1 = decode_and_normalize(sys.argv[1])
s2 = decode_and_normalize(sys.argv[2])
print levenshtein(s1, s2)
And after all that you may still run into problems if the characters are outside Basic Multilingual Plane. On this issue look at this stackoverlow question.
-
Thank you so much, it works! (I can't give upvotes due to small reputation). – user4598392 Mar 24 '15 at 23:33
-
Anyway, now I'm trying to slightly change the Leveshtein distance algorithm in a way that words with national characters and without them are treated as equal. For example: levenshtein(się, sie) = 0, levenshtein(jąkać, jakać) = 0 and so on. I created a list of national substitutions: [['ę', 'e'], ['ą', 'a'] ... ], but getting an error: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal. How to fix this? – user4598392 Mar 24 '15 at 23:37
-
@user4598392 I think you can still accept the answer. To have a unicode literal in Python 2 put `u` before the string. So the substitutions will be `[[u'ę', u'e'], [u'ą', u'a'] ... ]`. – Kolmar Mar 25 '15 at 08:37
-
I want to write thanks (after so long time) ;) And please, close the topic or sth, because problem is solved. – user4598392 Mar 31 '15 at 22:14