4

I am using Python and unfortunately my code needs to convert a string that represents Unicode characters in the string as \u1234 escapes into the original string, like:

Here is the code string that I got from other code:

\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5

I need to convert it back to the original string. How to do that?

tripleee
  • 175,061
  • 34
  • 275
  • 318
Bin Chen
  • 61,507
  • 53
  • 142
  • 183
  • 1
    Can you please post the `repr` of the byte string? – Fred Foo Jul 07 '12 at 14:12
  • what do you mean? it's pure string type with "\u6b22\u8fce\u63d0\u4ea4\u5fae\..." inside. – Bin Chen Jul 07 '12 at 14:22
  • Can you please explain _why_ you want to convert to a string? Because that cannot be done, but you can work around it by treating the unicode string as a unicode string. – C0deH4cker Jul 07 '12 at 14:25
  • how to do that? image some one passes me a variable a = '\u6b22\u8fce\u63d0\u4ea4\u5fae' and ask me to convert it to the original utf string(far east characters) – Bin Chen Jul 07 '12 at 14:26
  • 1
    take a look at this http://stackoverflow.com/questions/2365411/python-convert-unicode-to-ascii-without-errors – Surya Jul 07 '12 at 14:26
  • @Surya doesn't work, please try to understand my question thoroughly. – Bin Chen Jul 07 '12 at 14:36
  • Sorry, those are actually the same. Never mind. – Fred Foo Jul 07 '12 at 16:11
  • Where did that string come from? There are many, many different syntaxes that use `\u` escapes, and you need to choose the right one to avoid inconsistent results with any other escapes that are in there. JSON is one common possibility, but if that's what you've got you will need to use a JSON decoder rather than `unicode-escape` which is specific to Python Unicode string literals. – bobince Jul 08 '12 at 08:25

3 Answers3

17

I think this is what you want. It isn't UTF-8 byte string (well, technically it is, but only because ASCII is a subset of UTF-8).

>>> s='\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5'
>>> print s.decode('unicode-escape')
欢迎提交微博搜索使用反馈,请直接

FYI, this is UTF-8:

>>> s.decode('unicode-escape').encode('utf8')

'\xe6\xac\xa2\xe8\xbf\x8e\xe6\x8f\x90\xe4\xba\xa4\xe5\xbe\xae\xe5\x8d\x9a\xe6\x90\x9c\xe7\xb4\xa2\xe4\xbd\xbf\xe7\x94\xa8\xe5\x8f\x8d\xe9\xa6\x88\xef\xbc\x8c\xe8\xaf\xb7\xe7\x9b\xb4\xe6\x8e\xa5'

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
2

If I understand the question, we have a simple byte string, having Unicode escaping in it, or something like that:

a = '\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5'

In [122]: a
Out[122]: '\\u6b22\\u8fce\\u63d0\\u4ea4\\u5fae\\u535a\\u641c\\u7d22\\u4f7f\\u7528\\u53cd\\u9988\\uff0c\\u8bf7\\u76f4\\u63a5'

So we need to manually parse the unicode values from the string, using the Unicode code points:

\u6b22 => unichr(0x6b22) # 欢

or finally:

print "".join([unichr(int('0x'+a[i+2:i+6], 16)) for i in range(0, len(a), 6)])
欢迎提交微博搜索使用反馈,请直接
Joey
  • 344,408
  • 85
  • 689
  • 683
Tisho
  • 8,320
  • 6
  • 44
  • 52
-1

Mark Pilgrim had explained this in his book. Take a look

http://www.diveintopython.net/xml_processing/unicode.html

>>> s = u"\u6b22\u8fce\u63d0\u4ea4\u5fae\u535a\u641c\u7d22\u4f7f\u7528\u53cd\u9988\uff0c\u8bf7\u76f4\u63a5"

>>> print s.encode("utf-8")

>>> 欢迎提交微博搜索使用反馈,请直接
Surya
  • 4,824
  • 6
  • 38
  • 63
  • 2
    the string s that is passed to my code doesn't have u'' in front of it, it's a variable, try to replace the string to a variable b you will find your solution can't work syntactically. – Bin Chen Jul 07 '12 at 14:36