How to fix a string which says it's a unicode but is in fact bytestring

Question

Occasionally i have a string which says it's a unicode, but in fact it's not. It's like this:

s = u'\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae'

It's in fact just a bytestring with a 'u' in front of it. Don't know how to fix this. When i try to convert it to a real unicode with unicode(s, 'utf8'), code fails, because it's already been a unicode. Decoding with s.decode('utf8') fails with UnicodeEncodeError too.

@TimCastelijns, No, at least not the unicode i need. '\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae' is in fact u'诸葛亮', but u'\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae' is unprintable. — apporc, Apr 09 '14 at 10:33
The answer can be found in question: http://stackoverflow.com/questions/11174790/convert-unicode-string-to-byte-string — apporc, Apr 10 '14 at 08:56
If the questions are not duplicates of each other, you can form an answer and post it here yourself, this is useful for future references. — Tim, Apr 10 '14 at 08:58

score 0 · Answer 1 · edited May 23 '17 at 12:27

These are the two approaches i have now:

(1)First get the binary value of each character with ord(), then change back with chr().

>>> e
u'\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae'
>>> map(ord,e)
[232, 175, 184, 232, 145, 155, 228, 186, 174]
>>> map(chr,map(ord,e))
['\xe8', '\xaf', '\xb8', '\xe8', '\x91', '\x9b', '\xe4', '\xba', '\xae']
>>> ''.join(map(chr,map(ord,e)))
'\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae'
>>> print ''.join(map(chr,map(ord,e)))
诸葛亮

(2)As Ignacio Vazquez-Abrams says ISO 8859-1 (aka Latin-1) maps the first 256 Unicode codepoints to their byte values.

>>> e.encode('latin1')
'\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae'
>>> print e.encode('latin1')
诸葛亮

How to fix a string which says it's a unicode but is in fact bytestring

1 Answers1