Python Convert Unicode-Hex utf-8 strings to Unicode strings

Question

Have s = u'Gaga\xe2\x80\x99s' but need to convert to t = u'Gaga\u2019s'

How can this be best achieved?

unutbu · Answer 1 · 2011-09-30T11:52:49.633

7

s = u'Gaga\xe2\x80\x99s'
t = u'Gaga\u2019s'
x = s.encode('raw-unicode-escape').decode('utf-8')
assert x==t

print(x)

yields

Gaga’s

edited Sep 30 '11 at 11:52

answered Sep 30 '11 at 11:46

unutbu

842,883
184
1,785
1,677

I get "GagaÔÇÖs" in my windows terminal – rocksportrocker Sep 30 '11 at 11:49
`print repr(t)` still yields `'Gaga\xe2\x80\x99s'` – Acorn Sep 30 '11 at 11:49
@rocksportrocker, @Acorn `looks like he fixed that`. – agf Sep 30 '11 at 11:58
thank-you! @rocksportrocker, works too but can only accept one answer. – Henry Thornton Sep 30 '11 at 16:46
`raw-unicode-escape` is there to encode/decode `\u` escapes. That it happens to do a Latin-1 encode for characters below `\u0100` at the same time is a side-effect I'm not sure I'd want to rely on; I think Mark's version is the more commonly-used idiom for recovering mis-decoded-UTF-8. – bobince Sep 30 '11 at 20:41
@bobince: After some experimentation, I have to agree. Thanks for the warning. I really appreciate it. – unutbu Sep 30 '11 at 22:48
3

@dbv: After studying this some more, I think Mark Tolonen has the better answer. In the interest of having SO report the best answer at the top, please consider accepting [his answer](http://stackoverflow.com/questions/7609776/python-convert-unicode-hex-utf-8-strings-to-unicode-strings/7610946#7610946) instead. – unutbu Sep 30 '11 at 22:53
@unutbu: I applied Mark's and your method to our data and both worked. But, have taken advice on-board and changed best answer. Thank-you to everyone as these are typically tricky areas. – Henry Thornton Oct 03 '11 at 21:38

score 7 · Accepted Answer · answered Sep 30 '11 at 13:18

7

Where ever you decoded the original string, it was likely decoded with latin-1 or a close relative. Since latin-1 is the first 256 codepoints of Unicode, this works:

>>> s = u'Gaga\xe2\x80\x99s'
>>> s.encode('latin-1').decode('utf8')
u'Gaga\u2019s'

answered Sep 30 '11 at 13:18

Mark Tolonen

166,664
26
169
251

Hi, what if I want to do the vice-versa, converting it from unicode representation to hexadecimal representation, as I'm sending the data to some system that expects the unicode data in hex format. – securecurve Feb 12 '13 at 05:02
@securecurve, likely some form of encode. Ask a question with your specific requirements and sample input and output. – Mark Tolonen Feb 12 '13 at 15:19

rocksportrocker · Answer 3 · 2011-09-30T11:51:32.107

2

import codecs

s = u"Gaga\xe2\x80\x99s"
s_as_str = codecs.charmap_encode(s)[0]
t = unicode(s_as_str, "utf-8")
print t

prints

u'Gaga\u2019s'

edited Sep 30 '11 at 11:51

answered Sep 30 '11 at 11:42

rocksportrocker

7,251
2
31
48

Curious about this.. I don't see a `codecs.charmap_encode` in the 2.7 or 3.3 Python docs, link? – agf Oct 01 '11 at 04:53

Python Convert Unicode-Hex utf-8 strings to Unicode strings

3 Answers3

Linked