1
# coding=utf-8
import codecs

str_unicode = "\\u201c借\\u201d东风"
str_bytes = codecs.decode(str_unicode, 'unicode-escape')
print(str_bytes)

it print “å”ä¸é£ at the console.

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
GoTop
  • 850
  • 1
  • 9
  • 22

3 Answers3

3

Francisco Couzo correctly describes your issue. If you have control of the string, you should avoid escaping the quotation mark characters in your Unicode string. But I'm guessing that you didn't actually write that string yourself as a literal, but rather, you got it from external source (like a file).

If your Unicode string already has the extra escape characters in it, you can fix the problem by first encoding your data (using str.encode), then stripping the extra backslashes from the already encoded characters, then finally decoding again:

str_unicode = "\\u201c借\\u201d东风"  # or somefile.read(), or whatever

fixed = str_unicode.encode('unicode-escape').replace(b'\\\\', b'\\').decode('unicode-escape')

print(fixed)  # prints “借”东风
Blckknght
  • 100,903
  • 11
  • 120
  • 169
  • Thanks, your solution work. and your guess is right, I use https://github.com/hardikvasa/google-images-download to extact image metadata to a json file, then I got str_unicode from this json file. – GoTop Apr 21 '19 at 06:44
  • @GoTop: I'm glad this answer was useful to you. If you think it is the best answer to your question, please consider [clicking the check mark on the left to it to accept it](https://stackoverflow.com/help/accepted-answer). – Blckknght Apr 21 '19 at 07:28
1

You're not escaping the characters correctly, you have an extra \:

>>> print("\u201c借\u201d东风")
“借”东风
Francisco
  • 10,918
  • 6
  • 34
  • 45
-2

The Unicode standard contains a lot of tables listing characters and their corresponding code points:

0061    'a'; LATIN SMALL LETTER A
0062    'b'; LATIN SMALL LETTER B
0063    'c'; LATIN SMALL LETTER C
...
007B    '{'; LEFT CURLY BRACKET
...
2167    'Ⅶ': ROMAN NUMERAL EIGHT
2168    'Ⅸ': ROMAN NUMERAL NINE
...
265E    '♞': BLACK CHESS KNIGHT
265F    '♟': BLACK CHESS PAWN
...
1F600   '': GRINNING FACE
1F609   '': WINKING FACE
...

You can find out here on python 3 documentation on this link Unicode Python 3

Saad Ahmad
  • 36
  • 6
  • 1
    This doesn't seem to answer the question that was asked at all. Just linking to the Python docs is not nearly good enough. – Blckknght Apr 21 '19 at 06:32
  • I know the unicode \u201c mean “ , and \u201d mean ”, but I have to make these unicode print the right character in the console. so your answer doesn't help. but thanks anyway. – GoTop Apr 21 '19 at 06:35