Python unicode strings

Question

I'm a Python newbie and I'm trying to make one script that writes some strings in a file if there's a difference. Problem is that original string has some characters in \uNNNN Unicode format and I cannot convert the new string to the same Unicode format.

The original string I'm trying to compare: \u00A1 ATENCI\u00D3N! \u25C4

New string is received as: ¡ ATENCIÓN! ◄

And this the code

str = u'¡ ATENCIÓN! ◄'
print(str)
str1 = str.encode('unicode_escape')
print (str1)
str2 = str1.decode()
print (str2)

And the result is:

¡ ATENCIÓN! ◄
b'\\xa1 ATENCI\\xd3N! \\u25c4'
\xa1 ATENCI\xd3N! \u25c4

So, how can I get \xa1 ATENCI\xd3N! \u25c4 converted to \u00A1 ATENCI\u00D3N! \u25C4 as this is the only Unicode format I can save?

Note: Cases of characters in strings also need to be the same for comparison.

In Python 3, all strings are Unicode by default, so your first line can simply be `str = '¡ ATENCIÓN! ◄'`. I'm not sure what you're trying to do with the other lines. — MattDMo, Sep 05 '21 at 20:33
Thank for the first line tip! I'm trying to convert the new string to be in the same format as the old one. — user16837493, Sep 05 '21 at 20:44
Note that `str` is a built-in type, so it's highly discouraged (and confusing) to use it as a variable name. You could break code if you would do this in a larger application. — wovano, Sep 06 '21 at 12:28
Why do you encode using `unicode_escape` and decode using `utf-8` (and expect it to result in the original string)? If you use the same encoding for both, you should get the same result. If you use different encodings, it's logical you get a different result. — wovano, Sep 06 '21 at 12:31

MattDMo · Accepted Answer · 2021-09-07T13:34:36.747

-1

The issue is, according to the docs (read down a little bit, between the escape sequences tables), the \u, \U, and \N Unicode escape sequences are only recognized in string literals. That means that once the literal is evaluated in memory, such as in a variable assignment:

s = "\u00A1 ATENCI\u00D3N! \u25C4"

any attempt to str.encode() it automatically converts it to a bytes object that uses \x where it can:

b'\\xa1 ATENCI\\xd3N! \\u25c4'

Using

b'\\xa1 ATENCI\\xd3N! \\u25c4'.decode("unicode_escape")

will convert it back to '¡ ATENCIÓN! ◄'. This uses the actual (intended) representation of the characters, and not the \uXXXX escape sequences of the original string s.

So, what you should do is not mess around with encoding and decoding things. Observe:

print("\u00A1 ATENCI\u00D3N! \u25C4" == '¡ ATENCIÓN! ◄')
True

That's all the comparison you need to do.

For further reading, you may be interested in:

How to work with surrogate pairs in Python?
Encodings and Unicode from the Python docs.

edited Sep 07 '21 at 13:34

answered Sep 05 '21 at 21:12

MattDMo

100,794
21
241
231

Thanks for the clarification! Makes sense. Anyway, found a workaround using a method `escape_non_ascii` from https://github.com/Tblue/python-jproperties. That way I can get exactly what I want by printing `escape_non_ascii(str)` – user16837493 Sep 06 '21 at 07:15
There is so much wrong with this answer, but I'll try to explain: – wovano Sep 06 '21 at 17:19
(1) what do you mean with "Unicode escape characters". All characters are Unicode, there is no reason to "escape Unicode characers". If you are referring to the "\u00A1" form, that's just a way Unicode characters are represented in Python. – wovano Sep 06 '21 at 17:19
(2) where is stated that they are "only allowed in string literals"? And if that's the case, when are they not allowed? Could you give an example? – wovano Sep 06 '21 at 17:20
(3) "any attempt to `str.encode()` automatically converts it to a bytes literal". No, it converts it to `bytes`, that's what it does. The '\xa1' is (again) just the way the numbers are represented in Python if they are not standard human-readable characters. In the case of the question, the `'unicode_decode'` "encoding" is used, which explicitly converts the Unicode character `'\u00A1'` (which is a string with length 1) to the byte sequence `['\\', 'x', 'a', '1']` (which has length 4). – wovano Sep 06 '21 at 17:20
(4) "There is no way to convert \xa1 back to \u00a1". Of course there is. Just decode using the same encoding: `str1.decode('unicode_escape')` works. If you're talking about the character `'\xa1'`, then even `b'\xa1'.decode('latin-1')` would work. – wovano Sep 06 '21 at 17:21
@wovano many thanks for all the feedback. I'll try to incorporate it into my answer. As far as the "*only allowed in string literals*", I took that from [here](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals), between the two escape sequence tables. I misquoted, it should be "only *recognized* in string literals. – MattDMo Sep 07 '21 at 13:19

Python unicode strings

1 Answers1