2

I need to get the symbol's code in utf-8 and print symbol by this code. If I use ord('Ц') it returns 1062 and chr(1062) return a 'Ц', so it's all clear. But when I'm trying to do similar thing using bytes('Ц', encoding='utf-8'), it returns b'\xd0\xa6' though hex representation of 1062 is 0x426. How does it work? Why does it return two hex numbers, and value of these numbers are not equal to 1062?

Mortasen
  • 117
  • 1
  • 8
  • 1
    UTF-8 != HEX... – rdas Oct 16 '19 at 15:32
  • You can learn more about the utf-8 encoding here: https://en.wikipedia.org/wiki/UTF-8 TL;DR: utf-8 is a variable length encoding that won't match up with the hex representation of a number – rdas Oct 16 '19 at 15:35
  • [This answer](https://stackoverflow.com/a/6224384/4739755) also helps explain the difference between bytes and string (character) encodings. – b_c Oct 16 '19 at 15:36
  • UTF-8 is a multibyte encoding. Characters in the range 0 to 127 are encoded using one byte (exactly the same as ASCII). The range 128 to 65535 uses two or three bytes. A fourth byte may be used to encode characters beyond that. – ekhumoro Oct 16 '19 at 16:00
  • 1
    number `1062` is code in UNICODE, but `UTF-8` uses different codes. `byte` means values `0-255`, if something is bigger then 255 then `bytes()` keeps it as two or more values - and in `b'\xd0\xa6'` you have two values. – furas Oct 16 '19 at 16:00
  • @furas Not 255 - `chr(128).encode('utf-8')` -> `b'\xc2\x80'` (two bytes). – ekhumoro Oct 16 '19 at 16:04
  • @ekhumoro wrong, `bytes([128])` gives `b\x80`. But char with code `128` in `UNICODE`, doesn't have code `128` in UTF-8. It has code `49792` which gives `b'\xc2\x80'` – furas Oct 16 '19 at 16:10
  • @furas All characters in the range 128-255 are encoded using two bytes in UTF-8. The byte-sequence `\x80` is not valid UTF-8 (try `b'\x80'.decode('utf-8')`). The expression `chr(128)` returns a unicode string (i.e. text), not bytes. – ekhumoro Oct 16 '19 at 16:16
  • @ekhumoro but you didn't understand my previous comment. I was talking about `byte` and `bytes` - and why bytes display values bigger then 255 as two numbers and not one number like `0x426`. And it had nothing to do with `utf-8` which is different problem. – furas Oct 16 '19 at 16:20
  • @furas But UTF-8 will produce *two bytes* for characters less than 256 - so your point is not entirely valid. Note that the OP is explicitly encoding the bytes as UTF-8. – ekhumoro Oct 16 '19 at 16:26
  • @ekhumoro but I was talking about pure `byte` and `bytes` and not about `bytes` created by `utf`. BTW: when you encode to UTF then it replace `128` with `49792` and later it convert to bytes - so `bytes` never use value 128. – furas Oct 16 '19 at 16:33
  • @furas You said: "if something is bigger then 255 then bytes() keeps it as two or more values". This is not true - it depends on the encoding. There is no such thing as "pure bytes" when using `bytes()` with text strings. – ekhumoro Oct 16 '19 at 16:39
  • @ekhumoro - you still doesn't understand that bytes gets `49792`, not `128` - and it display `49792` as hex. bytes doesn't convert value `128` to `49792`. If other encoding convert `128` to `128` then bytes will get `128` to display it as hex. – furas Oct 16 '19 at 16:42
  • @ekhumoro `bytes()` is method to keep values as bytes - it doesn't have to be used to encode strings. You can create ie. `bytes([128, 255])` (`b'\x80\xff'`) and it has nothing to do with encoding. And this is what I called "pure bytes" – furas Oct 16 '19 at 16:47
  • @furas This is about text strings (unicode), not byte arrays. The number of hex values in the output depends on the encoding. Thus `chr(169).encode('utf-8')` -> `b'\xc2\xa9'` (two bytes), but `chr(169).encode('latin1')` -> `b'\xa9'` (one byte). So this is about multi-byte versus single-byte encodings of non-ascii text strings. – ekhumoro Oct 16 '19 at 16:57
  • @ekhumoro it is place for comment and comments can be more or less connected to code. – furas Oct 16 '19 at 17:04

0 Answers0