1

I trying to understand unicode and byte representation using hex. So in python i tried the below.

> 'I am a string Ņ'.encode('utf-8')
> b'I am a string \xc5\x85'

Here i see \xc5\x85 whats the meaning of this representation. The actul unicode for Ņ is \u0145

How \u0145 = \xc5\x85

Santhosh
  • 9,965
  • 20
  • 103
  • 243
  • 2
    the bytes `C5 85` is the UTF-8 encoding of the Unicode Codepoint `U+0145`. They are not equal, as one is an encoding of the unicode codepoint, the other is the unicode codepoint itself. – metatoaster Feb 03 '20 at 05:07
  • 1
    Does this answer your question? [Python 3 - Encode/Decode vs Bytes/Str](https://stackoverflow.com/questions/14472650/python-3-encode-decode-vs-bytes-str) – metatoaster Feb 03 '20 at 05:09
  • instead of `b'I am a string \xc5\x85'` it should show as `b'I am a string \u0145'` – Santhosh Feb 03 '20 at 05:10
  • I am not sure, i have to go through it. – Santhosh Feb 03 '20 at 05:12
  • No, in Python 3, the leading `b` prefix for a string literal turns that string into a `bytes` object, which is not a `str`. You cannot put unicode codepoints (which may be written as `\uXXXX`, X being a hexadecimal digit) inside a `bytes` literal. – metatoaster Feb 03 '20 at 05:12
  • But we can put characters like `I am a string` – Santhosh Feb 03 '20 at 05:13
  • https://stackoverflow.com/questions/47116818/ascii-code-point-vs-character-encoding - Looks nearer to my question. My question is related to code point vs encoding. – Santhosh Feb 03 '20 at 05:15
  • 2
    Python lets programmers write `b"I am a string"` simply because every single character enclosed by the double-quote has a standard encoding via the ASCII codec, from which it maps directly to bytes. In actuality you should be reading that construct as `b"\x49\x20\x61\x6d\x20\x61\x20\x73\x74\x72\x69\x6e\x67"`. – metatoaster Feb 03 '20 at 05:18
  • 1
    `'I am a string Ņ'.encode('unicode_escape')` shows `b'I am a string \\u0145'` – Santhosh Feb 03 '20 at 05:21
  • So its an internal mapping between code points and encoding. I thought that the mapping will be same as the codepoint. – Santhosh Feb 03 '20 at 05:23
  • The `'\\'` in `b'I am a string \\u0145'` forms a literal backslash character ` \ ` as that is the complete escape sequence, so what you are in fact getting is the encoded form of what `repr` might generate for the original `str`. As I noted before you need to understand that `str` is note `bytes`, and your assumptions of this naive conversion is incorrect. – metatoaster Feb 03 '20 at 05:28
  • 3
    To be absolutely pedantic, you have to look at each individual character separately. One way to do so is to encode the output as a `str` and then cast to `list` (i.e. `[chr(i) for i in b'I am a string \\u0145']`), you will then see this: `['I', ' ', 'a', 'm', ' ', 'a', ' ', 's', 't', 'r', 'i', 'n', 'g', ' ', '\\', 'u', '0', '1', '4', '5']` - note how the `Ņ` character is not actually present inside that output, only the 6 byte long `\u` prefixed encoding. – metatoaster Feb 03 '20 at 05:34
  • [This would be a much more basic SO question on this topic that is programming language agnostic](https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme) – metatoaster Feb 03 '20 at 05:36

0 Answers0