How to encode Python 3 string using \u escape code?

Question

In Python 3, suppose I have

>>> thai_string = 'สีเ'

Using encode gives

>>> thai_string.encode('utf-8')
b'\xe0\xb8\xaa\xe0\xb8\xb5'

My question: how can I get encode() to return a bytes sequence using \u instead of \x? And how can I decode them back to a Python 3 str type?

I tried using the ascii builtin, which gives

>>> ascii(thai_string)
"'\\u0e2a\\u0e35'"

But this doesn't seem quite right, as I can't decode it back to obtain thai_string.

Python documentation tells me that

\xhh escapes the character with the hex value hh while
\uxxxx escapes the character with the 16-bit hex value xxxx

The documentation says that \u is only used in string literals, but I'm not sure what that means. Is this a hint that my question has a flawed premise?

What about `.decode('utf-8')`? Aren't strings in Python unicode anyway? — Zizouz212, Aug 28 '15 at 22:45
@Zizouz212, neither `thai_string` nor `ascii(thai_string)` have a `decode` method, and `thai_string.encode('utf-8').decode('utf-8')` brings me back to where I started, `thai_string`, which is not the desired output. — Michael Currie, Aug 28 '15 at 23:00
Python documentation relevant to the escape sequence `\u`: https://docs.python.org/3/reference/lexical_analysis.html and https://docs.python.org/3/library/codecs.html#encodings-and-unicode — 0 _, Apr 08 '21 at 02:56
Does this answer your question? [How to work with surrogate pairs in Python?](https://stackoverflow.com/questions/38147259/how-to-work-with-surrogate-pairs-in-python) — ti7, Jul 26 '21 at 15:45
I also use `ascii(sku).replace(r"\x", r"\u00")` and works better — Felipe Buccioni, Jul 27 '21 at 22:48
@FelipeBuccioni That code corrupts strings that contain a backslash followed by a literal x. — benrg, Nov 04 '21 at 16:53

score 12 · Accepted Answer · answered Aug 28 '15 at 22:46

12

You can use unicode_escape:

>>> thai_string.encode('unicode_escape')
b'\\u0e2a\\u0e35\\u0e40'

Note that encode() will always return a byte string (bytes) and the unicode_escape encoding is intended to:

Produce a string that is suitable as Unicode literal in Python source code

answered Aug 28 '15 at 22:46

Simeon Visser

118,920
18
185
180

3

Perfect. But why does this string have two slashes before the "u" while the "x" only has one? – Michael Currie Aug 31 '15 at 03:26
This is simply how Python displays a literal backslash inside a quoted string. Compare `'\\n'` (literal backslash, literal `n`) to `'\n'` (newline character). – tripleee Nov 04 '21 at 15:27
If you want the result as a string, you can tack on `.decode('ascii')` – tripleee Nov 04 '21 at 15:28

How to encode Python 3 string using \u escape code?

1 Answers1

Linked

Related