How do I convert unicode string with cp1252 characters into UTF-8 with Python?

Question

I am getting text through an API that returns characters with a windows encoded apostrophe (\x92):

> python
>>> title = u'There\x92s thirty days in June'
>>> title
u'There\x92s thirty days in June'
>>> print title
Theres thirty days in June
>>> type(title)
<type 'unicode'>

I'm trying to convert this string to UTF-8 so that it instead returns: "There’s thirty days in June"

When I try to decode or encode this unicode string, it throws an error:

>>> title.decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 5: ordinal not in range(128)

>>> title.encode("cp1252").decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x92' in position 5: character maps to <undefined>

If I were to initialize the string as plain-text and then decode it, it works:

>>>title = 'There\x92s thirty days in June'
>>> type(title)
<type 'str'>
>>>print title.decode('cp1252')
There’s thirty days in June

My question is how do I convert the unicode string that I'm getting into a plain-text string so that I can decode it?

`u'\x92'` is a private use character in a Unicode string. `'\x92'` is a `RIGHT SINGLE QUOTATION MARK` in a cp1252-encoded byte string. Your API is decoding the string to Unicode incorrectly if you have the former. It would be `u'\u2019'` if decoded correctly. — Mark Tolonen, Jul 25 '17 at 15:50

score 10 · Accepted Answer · answered Jul 25 '17 at 01:48

It seems your string was decoded with latin1 (as it is of type unicode)

To convert it back to the bytes it originally was, you need to encode using that encoding (latin1)
Then to get text back (unicode) you must decode using the proper codec (cp1252)
finally, if you want to get to utf-8 bytes you must encode using the UTF-8 codec.

In code:

>>> title = u'There\x92s thirty days in June'
>>> title.encode('latin1')
'There\x92s thirty days in June'
>>> title.encode('latin1').decode('cp1252')
u'There\u2019s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252'))
There’s thirty days in June
>>> title.encode('latin1').decode('cp1252').encode('UTF-8')
'There\xe2\x80\x99s thirty days in June'
>>> print(title.encode('latin1').decode('cp1252').encode('UTF-8'))
There’s thirty days in June

Depending on whether the API takes text (unicode) or bytes, 3. may not be necessary.

This is good, able to handle so many encodings, including eg. Chinese GB18030. — Yan King Yin, Nov 09 '22 at 13:04
I like your technique. But I need to compare 2 strings that are localized in Cyrillic for Russian. One string is saved in a SQLite database as a filename. The other is that actual string used as a filename on disk. When I go to verify that the file was correctly generated, the strings are **visually identical** but != when I compare them. Obviously due to encoding. Do I need to know the encoding scheme of the strings before I do the encode/decode sequence to produce 2 identically encoded strings that will match? — horace, May 06 '23 at 13:46

How do I convert unicode string with cp1252 characters into UTF-8 with Python?

1 Answers1