1

I need to deal with a corrupt database in which names are stored one time with accents and one time whithout the non-ASCII characters. In particular, I have the following two records:

record_1 = u'Tim Münster'  
record_2 = u'Tim Mnster'

Is there a possibility to find such duplicate records?

Bach
  • 6,145
  • 7
  • 36
  • 61
Andy
  • 9,483
  • 12
  • 38
  • 39
  • This seems trivial. You can simply remove non-ascii characters with `"".join([x for x in s if ord(x)<128])` (hacky, but works), and use set operations to check for duplicates. Where are you stuck? – loopbackbee Apr 17 '14 at 10:50
  • @goncalopp that looks suspiciously like an answer! Why not post it as one? – Tom Fenech Apr 17 '14 at 10:52
  • @TomFenech I guess it seemed too easy to be what he was looking for – loopbackbee Apr 17 '14 at 10:59
  • Thanks, I like this hacky solution. Will try it and eventually post an update if it did not worked completely. – Andy Apr 17 '14 at 11:08

4 Answers4

0

You can remove wrong duplicates by comparing, for example, with re.sub(r'[^a-zA-Z ]', '', record_1).

Roberto Reale
  • 4,247
  • 1
  • 17
  • 21
0

You can remove non-ascii characters with

def remove_nonascii(s):
    return "".join(x for x in s if ord(x)<128)

To check for duplicates, simply use sets:

records= set([u'Tim Mnster'])
duplicates_to_check= [u'Tim Münster']
for possible_duplicate in duplicates_to_check:
    if remove_nonascii( possible_duplicate ) in records:
        print possible_duplicate, "is a duplicate"
loopbackbee
  • 21,962
  • 10
  • 62
  • 97
  • There's no need for the `[ ]` in your `join` – Tom Fenech Apr 17 '14 at 10:57
  • @TomFenech indeed, I'm still not used to using generator expressions on day-to-day situations. List comprehensions [might be slightly faster, though](https://stackoverflow.com/questions/11964130/list-comprehension-vs-generator-expressions-weird-timeit-results) – loopbackbee Apr 17 '14 at 11:05
  • @TomFenech Actually [`str.join` is both faster and memory-efficient](http://stackoverflow.com/a/9061024/846892) when used with a list. – Ashwini Chaudhary Apr 17 '14 at 11:21
  • @Aशwiniचhaudhary I never knew that. I might change my answer too then. Thanks! – Tom Fenech Apr 17 '14 at 11:23
  • @TomFenech If the strings are not very huge then the difference is not going to matter, and I do find `str.find` more readable when used with a . – Ashwini Chaudhary Apr 17 '14 at 11:25
0

Finding duplicate records-

I would do something like this:

1) remove characters with accent from the string.

import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nkfd_form.encode('ASCII', 'ignore')
    return only_ascii

2) check if, the string with removed accent (record_1) is equal to record_2 (by using Levenshtein distance algorithm), fuzzy match.

from nltk import metrics, stem, tokenize

def normalize(s):
    words = tokenize.wordpunct_tokenize(s.lower().strip())
    return ' '.join([stemmer.stem(w) for w in words])

def fuzzy_match(s1, s2, max_dist=2):
    return metrics.edit_distance(normalize(s1), normalize(s2)) <= max_dist
Viren Rajput
  • 5,426
  • 5
  • 30
  • 41
0

You could use ascii_letters and whitespace from the string module:

import string
def ascii_or_space(s):
    return "".join(c for c in s if c in string.ascii_letters + string.whitespace)

ascii_or_space(record_1) == record_2
True
Tom Fenech
  • 72,334
  • 12
  • 107
  • 141