11

In python2, there is string-escape and unicode-escape. For utf-8 byte string, string-escape could escape \ and keep non-ascii bytes, like:

"你好\\n".decode('string-escape')
'\xe4\xbd\xa0\xe5\xa5\xbd\n'

However, in python3, string-escape is removed. We have to encode string into bytes and decode it with unicode-escape:

"This\\n".encode('utf_8').decode('unicode_escape')
'This\n'

It does work with ascii bytes. But non-ascii bytes will also be escaped:

"你好\\n".encode('utf_8')
b'\xe4\xbd\xa0\xe5\xa5\xbd\\n'
"你好\\n".encode('utf_8').decode('unicode_escape').encode('utf_8')
b'\xc3\xa4\xc2\xbd\xc2\xa0\xc3\xa5\xc2\xa5\xc2\xbd\n'

All non-ascii bytes are escaped, which leads to encoding error.

So is there a solution for this ? Is it possible in python3 to keep all non-ascii bytes and decode all escape chars ?

Ning Sun
  • 2,145
  • 1
  • 19
  • 24

1 Answers1

5
import codecs
codecs.getdecoder('unicode_escape')('你好\\n')
raylu
  • 2,630
  • 3
  • 17
  • 23
  • 1
    This does not work; the decoder will implicitly encode the string to UTF-8 before applying the decoding, and then the decoding translates the UTF-8 bytes into individual characters. I get `('ä½\xa0好\n', 8)` from the above code. – Karl Knechtel Aug 04 '22 at 22:07
  • huh, you're right. that no longer works now. I'm not sure what changed. I also can't find a solution that works now. https://github.com/python/cpython/issues/65530 seems to explain the issue but not when it broke. https://stackoverflow.com/q/48908131/385891 is only valid for ASCII now (I think) – raylu Oct 23 '22 at 01:57