Python: how to convert string with \unnnn escapes to Unicode string?

Question

I am using Python and unfortunately my code needs to convert a string that represents Unicode characters in the string as \u1234 escapes into the original string, like:

Here is the code string that I got from other code:

\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5

I need to convert it back to the original string. How to do that?

what do you mean? it's pure string type with "\u6b22\u8fce\u63d0\u4ea4\u5fae\..." inside. — Bin Chen, Jul 07 '12 at 14:22
Can you please explain _why_ you want to convert to a string? Because that cannot be done, but you can work around it by treating the unicode string as a unicode string. — C0deH4cker, Jul 07 '12 at 14:25
how to do that? image some one passes me a variable a = '\u6b22\u8fce\u63d0\u4ea4\u5fae' and ask me to convert it to the original utf string(far east characters) — Bin Chen, Jul 07 '12 at 14:26
take a look at this http://stackoverflow.com/questions/2365411/python-convert-unicode-to-ascii-without-errors — Surya, Jul 07 '12 at 14:26
@Surya doesn't work, please try to understand my question thoroughly. — Bin Chen, Jul 07 '12 at 14:36
Where did that string come from? There are many, many different syntaxes that use `\u` escapes, and you need to choose the right one to avoid inconsistent results with any other escapes that are in there. JSON is one common possibility, but if that's what you've got you will need to use a JSON decoder rather than `unicode-escape` which is specific to Python Unicode string literals. — bobince, Jul 08 '12 at 08:25

score 17 · Accepted Answer · answered Jul 07 '12 at 16:43

17

I think this is what you want. It isn't UTF-8 byte string (well, technically it is, but only because ASCII is a subset of UTF-8).

>>> s='\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5'
>>> print s.decode('unicode-escape')
欢迎提交微博搜索使用反馈，请直接

FYI, this is UTF-8:

>>> s.decode('unicode-escape').encode('utf8')

'\xe6\xac\xa2\xe8\xbf\x8e\xe6\x8f\x90\xe4\xba\xa4\xe5\xbe\xae\xe5\x8d\x9a\xe6\x90\x9c\xe7\xb4\xa2\xe4\xbd\xbf\xe7\x94\xa8\xe5\x8f\x8d\xe9\xa6\x88\xef\xbc\x8c\xe8\xaf\xb7\xe7\x9b\xb4\xe6\x8e\xa5'

answered Jul 07 '12 at 16:43

Mark Tolonen

166,664
26
169
251

Isn't there output missing from the second line? – DSM Jul 07 '12 at 16:46
Yes, it was the first line with a `u` in front. I deleted one but not the other in my edit. – Mark Tolonen Jul 07 '12 at 16:48

score 2 · Answer 2 · edited Jul 07 '12 at 15:16

If I understand the question, we have a simple byte string, having Unicode escaping in it, or something like that:

a = '\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5'

In [122]: a
Out[122]: '\\u6b22\\u8fce\\u63d0\\u4ea4\\u5fae\\u535a\\u641c\\u7d22\\u4f7f\\u7528\\u53cd\\u9988\\uff0c\\u8bf7\\u76f4\\u63a5'

So we need to manually parse the unicode values from the string, using the Unicode code points:

\u6b22 => unichr(0x6b22) # 欢

or finally:

print "".join([unichr(int('0x'+a[i+2:i+6], 16)) for i in range(0, len(a), 6)])
欢迎提交微博搜索使用反馈，请直接

score -1 · Answer 3 · answered Jul 07 '12 at 14:33

-1

Mark Pilgrim had explained this in his book. Take a look

http://www.diveintopython.net/xml_processing/unicode.html

>>> s = u"\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5"

>>> print s.encode("utf-8")

>>> 欢迎提交微博搜索使用反馈，请直接

answered Jul 07 '12 at 14:33

Surya

4,824
6
38
63

2

the string s that is passed to my code doesn't have u'' in front of it, it's a variable, try to replace the string to a variable b you will find your solution can't work syntactically. – Bin Chen Jul 07 '12 at 14:36

Python: how to convert string with \unnnn escapes to Unicode string?

3 Answers3

Linked

Related