1

I have used Python to get some info through urllib2, but the info is unicode string.

I've tried something like below:

a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print unicode(a).encode("gb2312")

a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.encode("utf-8").decode("utf-8")

a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print u""+a

a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).decode("utf-8")

a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).encode("utf-8")

a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.decode("utf-8").encode("gb2312")

but all results are the same:

\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728

And I want to get the following Chinese text:

方法,删除存储在
Remi Guan
  • 21,506
  • 17
  • 64
  • 87
Lex
  • 13
  • 1
  • 5
  • 1
    Which python version are you using? Maybe you need `from __future__ import unicode_literals` – gil Feb 23 '16 at 12:50
  • 2
    ​​​​​​​​​​​​​​​My answer: Just use Python 3 and the `a` will be your expected string and you don't need convert it yourself. – Remi Guan Feb 23 '16 at 12:51
  • Maybe the console only supports ascii characters? – Amit Gold Feb 23 '16 at 12:54
  • And also [this one](http://stackoverflow.com/questions/2688020/how-to-print-chinese-word-in-my-code-using-python). Oh hey, [there's also another way](http://stackoverflow.com/questions/19371953/python-2-7-converting-unicode-to-chinese-character). – Remi Guan Feb 23 '16 at 13:22

2 Answers2

2

You need to convert the string to a unicode string.

First of all, the backslashes in a are auto-escaped:

a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"

print a # Prints \u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728

a       # Prints '\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'

So playing with the encoding / decoding of this escaped string makes no difference.

You can either use unicode literal or convert the string into a unicode string.

To use unicode literal, just add a u in the front of the string:

a = u"\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"

To convert existing string into a unicode string, you can call unicode, with unicode_escape as the encoding parameter:

print unicode(a, encoding='unicode_escape') # Prints 方法,删除存储在

I bet you are getting the string from a JSON response, so the second method is likely to be what you need.

BTW, the unicode_escape encoding is a Python specific encoding which is used to

Produce a string that is suitable as Unicode literal in Python source code

Rikka
  • 999
  • 8
  • 19
  • Yes, `unicode_escape` seems the way to go. – mhawke Feb 23 '16 at 13:27
  • thanks you very very much !!!! – Lex Feb 24 '16 at 01:56
  • `a = '\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728' print(unicode(a, encoding='unicode_escape'))` build error in python3 with " decoding str is not supported",which unicode module should i import or i should by another way in python3? thanks very much! – Lex Aug 12 '16 at 00:48
  • how to implement this in python3 ? – linrongbin Jan 18 '21 at 06:28
0

Where are you getting this data from? Perhaps you could share the method by which you are downloading and extracting it.

Anyway, it kind of looks like a remnant of some JSON encoded string? Based on that assumption, here is a very hacky (and not entirely serious) way to do it:

>>> a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
>>> a
'\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
>>> s = '"{}"'.format(a)
>>> s
'"\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728"'
>>> import json
>>> json.loads(s)
u'\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728'
>>> print json.loads(s)
方法,删除存储在

This involves recreating a valid JSON encoded string by wrapping the given string in a in double quotes, then decoding the JSON string into a Python unicode string.

mhawke
  • 84,695
  • 9
  • 117
  • 138