python: difference between \x and \u in characters

Question

I trying to understand unicode and byte representation using hex. So in python i tried the below.

> 'I am a string Ņ'.encode('utf-8')
> b'I am a string \xc5\x85'

Here i see \xc5\x85 whats the meaning of this representation. The actul unicode for Ņ is \u0145

How \u0145 = \xc5\x85

the bytes `C5 85` is the UTF-8 encoding of the Unicode Codepoint `U+0145`. They are not equal, as one is an encoding of the unicode codepoint, the other is the unicode codepoint itself. — metatoaster, Feb 03 '20 at 05:07
Does this answer your question? [Python 3 - Encode/Decode vs Bytes/Str](https://stackoverflow.com/questions/14472650/python-3-encode-decode-vs-bytes-str) — metatoaster, Feb 03 '20 at 05:09
instead of `b'I am a string \xc5\x85'` it should show as `b'I am a string \u0145'` — Santhosh, Feb 03 '20 at 05:10
No, in Python 3, the leading `b` prefix for a string literal turns that string into a `bytes` object, which is not a `str`. You cannot put unicode codepoints (which may be written as `\uXXXX`, X being a hexadecimal digit) inside a `bytes` literal. — metatoaster, Feb 03 '20 at 05:12
https://stackoverflow.com/questions/47116818/ascii-code-point-vs-character-encoding - Looks nearer to my question. My question is related to code point vs encoding. — Santhosh, Feb 03 '20 at 05:15
Python lets programmers write `b"I am a string"` simply because every single character enclosed by the double-quote has a standard encoding via the ASCII codec, from which it maps directly to bytes. In actuality you should be reading that construct as `b"\x49\x20\x61\x6d\x20\x61\x20\x73\x74\x72\x69\x6e\x67"`. — metatoaster, Feb 03 '20 at 05:18
`'I am a string Ņ'.encode('unicode_escape')` shows `b'I am a string \\u0145'` — Santhosh, Feb 03 '20 at 05:21
So its an internal mapping between code points and encoding. I thought that the mapping will be same as the codepoint. — Santhosh, Feb 03 '20 at 05:23
The `'\\'` in `b'I am a string \\u0145'` forms a literal backslash character ` \ ` as that is the complete escape sequence, so what you are in fact getting is the encoded form of what `repr` might generate for the original `str`. As I noted before you need to understand that `str` is note `bytes`, and your assumptions of this naive conversion is incorrect. — metatoaster, Feb 03 '20 at 05:28
To be absolutely pedantic, you have to look at each individual character separately. One way to do so is to encode the output as a `str` and then cast to `list` (i.e. `[chr(i) for i in b'I am a string \\u0145']`), you will then see this: `['I', ' ', 'a', 'm', ' ', 'a', ' ', 's', 't', 'r', 'i', 'n', 'g', ' ', '\\', 'u', '0', '1', '4', '5']` - note how the `Ņ` character is not actually present inside that output, only the 6 byte long `\u` prefixed encoding. — metatoaster, Feb 03 '20 at 05:34
[This would be a much more basic SO question on this topic that is programming language agnostic](https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme) — metatoaster, Feb 03 '20 at 05:36

python: difference between \x and \u in characters

0 Answers0