10

In Python 3, suppose I have

>>> thai_string = 'สีเ'

Using encode gives

>>> thai_string.encode('utf-8')
b'\xe0\xb8\xaa\xe0\xb8\xb5'

My question: how can I get encode() to return a bytes sequence using \u instead of \x? And how can I decode them back to a Python 3 str type?

I tried using the ascii builtin, which gives

>>> ascii(thai_string)
"'\\u0e2a\\u0e35'"

But this doesn't seem quite right, as I can't decode it back to obtain thai_string.

Python documentation tells me that

  • \xhh escapes the character with the hex value hh while
  • \uxxxx escapes the character with the 16-bit hex value xxxx

The documentation says that \u is only used in string literals, but I'm not sure what that means. Is this a hint that my question has a flawed premise?

Michael Currie
  • 13,721
  • 9
  • 42
  • 58
  • What about `.decode('utf-8')`? Aren't strings in Python unicode anyway? – Zizouz212 Aug 28 '15 at 22:45
  • @Zizouz212, neither `thai_string` nor `ascii(thai_string)` have a `decode` method, and `thai_string.encode('utf-8').decode('utf-8')` brings me back to where I started, `thai_string`, which is not the desired output. – Michael Currie Aug 28 '15 at 23:00
  • Python documentation relevant to the escape sequence `\u`: https://docs.python.org/3/reference/lexical_analysis.html and https://docs.python.org/3/library/codecs.html#encodings-and-unicode – 0 _ Apr 08 '21 at 02:56
  • Relevant: https://stackoverflow.com/q/1347791/1959808 – 0 _ Apr 08 '21 at 03:01
  • Does this answer your question? [How to work with surrogate pairs in Python?](https://stackoverflow.com/questions/38147259/how-to-work-with-surrogate-pairs-in-python) – ti7 Jul 26 '21 at 15:45
  • I also use `ascii(sku).replace(r"\x", r"\u00")` and works better – Felipe Buccioni Jul 27 '21 at 22:48
  • @FelipeBuccioni That code corrupts strings that contain a backslash followed by a literal x. – benrg Nov 04 '21 at 16:53

1 Answers1

12

You can use unicode_escape:

>>> thai_string.encode('unicode_escape')
b'\\u0e2a\\u0e35\\u0e40'

Note that encode() will always return a byte string (bytes) and the unicode_escape encoding is intended to:

Produce a string that is suitable as Unicode literal in Python source code

Simeon Visser
  • 118,920
  • 18
  • 185
  • 180
  • 3
    Perfect. But why does this string have two slashes before the "u" while the "x" only has one? – Michael Currie Aug 31 '15 at 03:26
  • This is simply how Python displays a literal backslash inside a quoted string. Compare `'\\n'` (literal backslash, literal `n`) to `'\n'` (newline character). – tripleee Nov 04 '21 at 15:27
  • If you want the result as a string, you can tack on `.decode('ascii')` – tripleee Nov 04 '21 at 15:28