how do I write a custom encoding in python to clean up my data?

Question

I know I've done this before at another job, but I can't remember what I did.

I have a database that is full of varchar and memo fields that were cut and pasted from Office, webpages, and who knows where else. This is starting to cause encoding errors for me. Since Python has a very nice "decode" function to take a byte stream and translate it into Unicode, I thought that would just write my own encoding to fix this up. (For example, to take "smart quotes" and turn them into "standard quotes".)

But I can't remember how to get started. I think I copied one of the encodings that was close (cp1252.py) and then updated it.

Can anyone put me on the right path? Or suggest a better path?

samplebias · Answer 1 · 2011-04-29T21:30:26.957

I've expanded this with a bit more detail.

If you are reasonably sure of the encoding of the text in the database, you can do text.decode('cp1252') to get a Unicode string. If the guess is wrong this will likely blow up with an exception, or the decoder will 'disappear' some characters.

Creating a decoder along the lines you describe (modifying cp1252.py) is easy. You just need to define the translation table from bytes to Unicode characters.

However if not all of the text in the database has the same encoding, your decoder will need some rules to decide which is the correct mapping. In this case you may want punt and use the chardet module, which can scan the text and make a guess the encoding.

Maybe the best approach would be try to decode using the most likely encoding (cp1252) and if that fails, fallback to using chardet to guess the correct encoding.

If you use text.decode() and/or chardet, you'll end up with a Unicode string. Below is a simple routine which can translate characters in a Unicode string, e.g. "convert curly quotes to ASCII":

CHARMAP = [
    (u'\u201c\u201d', '"'),
    (u'\u2018\u2019', "'")
    ]

# replace with text.decode('cp1252') or chardet
text = u'\u201cit\u2019s probably going to work\u201d, he said'

_map = dict((c, r) for chars, r in CHARMAP for c in list(chars))
fixed = ''.join(_map.get(c, c) for c in text)
print fixed

Output:

"it's probably going to work", he said

how do I write a custom encoding in python to clean up my data?

1 Answers1

Linked