0

Occasionally i have a string which says it's a unicode, but in fact it's not. It's like this:

s = u'\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae'

It's in fact just a bytestring with a 'u' in front of it. Don't know how to fix this. When i try to convert it to a real unicode with unicode(s, 'utf8'), code fails, because it's already been a unicode. Decoding with s.decode('utf8') fails with UnicodeEncodeError too.

apporc
  • 870
  • 3
  • 11
  • 23
  • If python says your string is unicode, it is unicode – Tim Apr 09 '14 at 10:31
  • @TimCastelijns, No, at least not the unicode i need. '\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae' is in fact u'诸葛亮', but u'\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae' is unprintable. – apporc Apr 09 '14 at 10:33
  • The answer can be found in question: http://stackoverflow.com/questions/11174790/convert-unicode-string-to-byte-string – apporc Apr 10 '14 at 08:56
  • If the questions are not duplicates of each other, you can form an answer and post it here yourself, this is useful for future references. – Tim Apr 10 '14 at 08:58

1 Answers1

0

These are the two approaches i have now:

(1)First get the binary value of each character with ord(), then change back with chr().

>>> e
u'\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae'
>>> map(ord,e)
[232, 175, 184, 232, 145, 155, 228, 186, 174]
>>> map(chr,map(ord,e))
['\xe8', '\xaf', '\xb8', '\xe8', '\x91', '\x9b', '\xe4', '\xba', '\xae']
>>> ''.join(map(chr,map(ord,e)))
'\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae'
>>> print ''.join(map(chr,map(ord,e)))
诸葛亮

(2)As Ignacio Vazquez-Abrams says ISO 8859-1 (aka Latin-1) maps the first 256 Unicode codepoints to their byte values.

>>> e.encode('latin1')
'\xe8\xaf\xb8\xe8\x91\x9b\xe4\xba\xae'
>>> print e.encode('latin1')
诸葛亮
Community
  • 1
  • 1
apporc
  • 870
  • 3
  • 11
  • 23