5

Have s = u'Gaga\xe2\x80\x99s' but need to convert to t = u'Gaga\u2019s'

How can this be best achieved?

eumiro
  • 207,213
  • 34
  • 299
  • 261
Henry Thornton
  • 4,381
  • 9
  • 36
  • 43

3 Answers3

7
s = u'Gaga\xe2\x80\x99s'
t = u'Gaga\u2019s'
x = s.encode('raw-unicode-escape').decode('utf-8')
assert x==t

print(x)

yields

Gaga’s
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • I get "GagaÔÇÖs" in my windows terminal – rocksportrocker Sep 30 '11 at 11:49
  • `print repr(t)` still yields `'Gaga\xe2\x80\x99s'` – Acorn Sep 30 '11 at 11:49
  • @rocksportrocker, @Acorn `looks like he fixed that`. – agf Sep 30 '11 at 11:58
  • thank-you! @rocksportrocker, works too but can only accept one answer. – Henry Thornton Sep 30 '11 at 16:46
  • `raw-unicode-escape` is there to encode/decode `\u` escapes. That it happens to do a Latin-1 encode for characters below `\u0100` at the same time is a side-effect I'm not sure I'd want to rely on; I think Mark's version is the more commonly-used idiom for recovering mis-decoded-UTF-8. – bobince Sep 30 '11 at 20:41
  • @bobince: After some experimentation, I have to agree. Thanks for the warning. I really appreciate it. – unutbu Sep 30 '11 at 22:48
  • 3
    @dbv: After studying this some more, I think Mark Tolonen has the better answer. In the interest of having SO report the best answer at the top, please consider accepting [his answer](http://stackoverflow.com/questions/7609776/python-convert-unicode-hex-utf-8-strings-to-unicode-strings/7610946#7610946) instead. – unutbu Sep 30 '11 at 22:53
  • @unutbu: I applied Mark's and your method to our data and both worked. But, have taken advice on-board and changed best answer. Thank-you to everyone as these are typically tricky areas. – Henry Thornton Oct 03 '11 at 21:38
7

Where ever you decoded the original string, it was likely decoded with latin-1 or a close relative. Since latin-1 is the first 256 codepoints of Unicode, this works:

>>> s = u'Gaga\xe2\x80\x99s'
>>> s.encode('latin-1').decode('utf8')
u'Gaga\u2019s'
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Hi, what if I want to do the vice-versa, converting it from unicode representation to hexadecimal representation, as I'm sending the data to some system that expects the unicode data in hex format. – securecurve Feb 12 '13 at 05:02
  • @securecurve, likely some form of encode. Ask a question with your specific requirements and sample input and output. – Mark Tolonen Feb 12 '13 at 15:19
2
import codecs

s = u"Gaga\xe2\x80\x99s"
s_as_str = codecs.charmap_encode(s)[0]
t = unicode(s_as_str, "utf-8")
print t

prints

u'Gaga\u2019s'
rocksportrocker
  • 7,251
  • 2
  • 31
  • 48
  • Curious about this.. I don't see a `codecs.charmap_encode` in the 2.7 or 3.3 Python docs, link? – agf Oct 01 '11 at 04:53