I've expanded this with a bit more detail.
If you are reasonably sure of the encoding of the text in the database, you can do text.decode('cp1252')
to get a Unicode string. If the guess is wrong this will likely blow up with an exception, or the decoder will 'disappear' some characters.
Creating a decoder along the lines you describe (modifying cp1252.py
) is easy. You just need to define the translation table from bytes to Unicode characters.
However if not all of the text in the database has the same encoding, your decoder will need some rules to decide which is the correct mapping. In this case you may want punt and use the chardet module, which can scan the text and make a guess the encoding.
Maybe the best approach would be try to decode using the most likely encoding (cp1252) and if that fails, fallback to using chardet to guess the correct encoding.
If you use text.decode()
and/or chardet, you'll end up with a Unicode string. Below is a simple routine which can translate characters in a Unicode string, e.g. "convert curly quotes to ASCII":
CHARMAP = [
(u'\u201c\u201d', '"'),
(u'\u2018\u2019', "'")
]
# replace with text.decode('cp1252') or chardet
text = u'\u201cit\u2019s probably going to work\u201d, he said'
_map = dict((c, r) for chars, r in CHARMAP for c in list(chars))
fixed = ''.join(_map.get(c, c) for c in text)
print fixed
Output:
"it's probably going to work", he said