Remove characters with accents from string

Question

I need to deal with a corrupt database in which names are stored one time with accents and one time whithout the non-ASCII characters. In particular, I have the following two records:

record_1 = u'Tim Münster'  
record_2 = u'Tim Mnster'

Is there a possibility to find such duplicate records?

This seems trivial. You can simply remove non-ascii characters with `"".join([x for x in s if ord(x)<128])` (hacky, but works), and use set operations to check for duplicates. Where are you stuck? — loopbackbee, Apr 17 '14 at 10:50
@goncalopp that looks suspiciously like an answer! Why not post it as one? — Tom Fenech, Apr 17 '14 at 10:52
@TomFenech I guess it seemed too easy to be what he was looking for — loopbackbee, Apr 17 '14 at 10:59
Thanks, I like this hacky solution. Will try it and eventually post an update if it did not worked completely. — Andy, Apr 17 '14 at 11:08

score 0 · Answer 1 · answered Apr 17 '14 at 10:52

0

You can remove wrong duplicates by comparing, for example, with re.sub(r'[^a-zA-Z ]', '', record_1).

answered Apr 17 '14 at 10:52

Roberto Reale

4,247
1
17
21

loopbackbee · Answer 2 · 2014-04-17T11:05:33.043

0

You can remove non-ascii characters with

def remove_nonascii(s):
    return "".join(x for x in s if ord(x)<128)

To check for duplicates, simply use sets:

records= set([u'Tim Mnster'])
duplicates_to_check= [u'Tim Münster']
for possible_duplicate in duplicates_to_check:
    if remove_nonascii( possible_duplicate ) in records:
        print possible_duplicate, "is a duplicate"

edited Apr 17 '14 at 11:05

answered Apr 17 '14 at 10:53

loopbackbee

21,962
10
62
97

There's no need for the `[ ]` in your `join` – Tom Fenech Apr 17 '14 at 10:57
@TomFenech indeed, I'm still not used to using generator expressions on day-to-day situations. List comprehensions [might be slightly faster, though](https://stackoverflow.com/questions/11964130/list-comprehension-vs-generator-expressions-weird-timeit-results) – loopbackbee Apr 17 '14 at 11:05
@TomFenech Actually [`str.join` is both faster and memory-efficient](http://stackoverflow.com/a/9061024/846892) when used with a list. – Ashwini Chaudhary Apr 17 '14 at 11:21
@Aशwiniचhaudhary I never knew that. I might change my answer too then. Thanks! – Tom Fenech Apr 17 '14 at 11:23
@TomFenech If the strings are not very huge then the difference is not going to matter, and I do find `str.find` more readable when used with a . – Ashwini Chaudhary Apr 17 '14 at 11:25

score 0 · Answer 3 · answered Apr 17 '14 at 11:12

Finding duplicate records-

I would do something like this:

1) remove characters with accent from the string.

import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nkfd_form.encode('ASCII', 'ignore')
    return only_ascii

2) check if, the string with removed accent (record_1) is equal to record_2 (by using Levenshtein distance algorithm), fuzzy match.

from nltk import metrics, stem, tokenize

def normalize(s):
    words = tokenize.wordpunct_tokenize(s.lower().strip())
    return ' '.join([stemmer.stem(w) for w in words])

def fuzzy_match(s1, s2, max_dist=2):
    return metrics.edit_distance(normalize(s1), normalize(s2)) <= max_dist

score 0 · Answer 4 · answered Apr 17 '14 at 11:15

0

You could use ascii_letters and whitespace from the string module:

import string
def ascii_or_space(s):
    return "".join(c for c in s if c in string.ascii_letters + string.whitespace)

ascii_or_space(record_1) == record_2
True

answered Apr 17 '14 at 11:15

Tom Fenech

72,334
12
107
141

Remove characters with accents from string

4 Answers4