2

Are there some edit-distance in python that take account of the accent. Where for exemple hold the following property

d('ab', 'ac') > d('àb', 'ab') > 0
Jonas Adler
  • 10,365
  • 5
  • 46
  • 73
vigte
  • 95
  • 1
  • 8
  • 1
    Wouldn't just replacing accented letters with non-accented in both the strings, then calculating the distance work? – Dogbert Apr 03 '13 at 14:48
  • I second that. Using Unidecode might help: https://pypi.python.org/pypi/Unidecode/0.04.1 – nicbou Apr 03 '13 at 14:49
  • ok thanks, but at this point I have d('àa','aa') = 0. – vigte Apr 03 '13 at 14:52
  • Where is the problem? You don't know how to tell if a given character is an "accented" version of another, or how to integrate this fact in the distance itself?(or both?) – Bakuriu Apr 03 '13 at 14:56
  • @vigte, what values do you want then? – Dogbert Apr 03 '13 at 15:08
  • I think I'm able to do it manualy, but I just ask if there are an optimazed vesion alreddy done that take account of accents distance. – vigte Apr 03 '13 at 15:08
  • something like d('ab', 'ac') > d('àb', 'ab') > 0 – vigte Apr 03 '13 at 15:12
  • One of the main point of the distance should be give the better value strictly greate then 0 for d('àa','aa') – vigte Apr 03 '13 at 15:15

3 Answers3

6

With the Levenshtein module:

In [1]: import unicodedata, string

In [2]: from Levenshtein import distance

In [3]: def remove_accents(data):
   ...:     return ''.join(x for x in unicodedata.normalize('NFKD', data)
   ...:                             if x in string.ascii_letters).lower()

In [4]: def norm_dist(s1, s2):
   ...:     norm1, norm2 = remove_accents(s1), remove_accents(s2)
   ...:     d1, d2 = distance(s1, s2), distance(norm1, norm2)
   ...:     return (d1+d2)/2.

In [5]: norm_dist(u'ab', u'ac')
Out[5]: 1.0

In [6]: norm_dist(u'àb', u'ab')
Out[6]: 0.5
root
  • 76,608
  • 25
  • 108
  • 120
3

Unicode allows decomposition of accented characters into the base character plus a combining accent character; e.g. à decomposes into a followed by a combining grave accent.

You want to convert both strings using normalization form NFKD, which decomposes accented characters and converts compatibility characters to their canonical forms, then use an edit distance metric that ranks substitutions above insertions and deletions.

ecatmur
  • 152,476
  • 27
  • 293
  • 366
1

Here's a solution based on difflib and unicodedata with no dependencies whatsoever:

import unicodedata
from difflib import Differ

# function taken from https://stackoverflow.com/a/517974/1222951
def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore').decode()
    return only_ascii

def compare(wrong, right):
    # normalize both strings to make sure equivalent (but
    # different) unicode characters are canonicalized 
    wrong = unicodedata.normalize('NFKC', wrong)
    right = unicodedata.normalize('NFKC', right)

    num_diffs = 0
    index = 0
    differences = list(Differ().compare(wrong, right))
    while True:
        try:
            diff = differences[index]
        except IndexError:
            break

        # diff is a string like "+ a" (meaning the character "a" was inserted)
        # extract the operation and the character
        op = diff[0]
        char = diff[-1]

        # if the character isn't equal in both
        # strings, increase the difference counter
        if op != ' ':
            num_diffs += 1

        # if a character is wrong, there will be two operations: one
        # "+" and one "-" operation
        # we want to count this as a single mistake, not as two mistakes
        if op in '+-':
            try:
                next_diff = differences[index+1]
            except IndexError:
                pass
            else:
                next_op = next_diff[0]
                if next_op in '+-' and next_op != op:
                    # skip the next operation, we don't want to count
                    # it as another mistake
                    index += 1

                    # we know that the character is wrong, but
                    # how wrong is it?
                    # if the only difference is the accent, it's
                    # a minor mistake
                    next_char = next_diff[-1]
                    if remove_accents(char) == remove_accents(next_char):
                        num_diffs -= 0.5

        index += 1

    # output the difference as a ratio of
    # (# of wrong characters) / (length of longest input string)
    return num_diffs / max(len(wrong), len(right))

Tests:

for w, r in (('ab','ac'),
            ('àb','ab'),
            ('être','etre'),
            ('très','trés'),
            ):
    print('"{}" and "{}": {}% difference'.format(w, r, compare(w, r)*100))
"ab" and "ac": 50.0% difference
"àb" and "ab": 25.0% difference
"être" and "etre": 12.5% difference
"très" and "trés": 12.5% difference
Aran-Fey
  • 39,665
  • 11
  • 104
  • 149