1

I have a list of hex that I would like to transform into a list of unicode characters. Everything here is done with python-3.5.

If I do print(binary.fromhex('hex_number').decode('utf-8')) it works. But does not work if, after the conversion, I store, again, the chars in the list:

a = ['0063'] # Which is the hex equivalent to the c char.
b = [binary.fromhex(_).decode('utf-8') for _ in a]
print(b)

will print

['\x00c']

instead of

['c']

while the code

a = ['0063']
for _ in a:
    print(binary.fromhex(_).decode('utf-8'))

prints, has expected:

c

Can someone explain to me how I can convert the list ['0063'] in the list ['c'] and why I get this strange (to me) behavior?

To see what the 0063 hex corresponds look here.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Riccardo Petraglia
  • 1,943
  • 1
  • 13
  • 25
  • Why would `0063`, decoded as UTF-8, *ever* produce `'c'`? And why would `030C` map to a space (which encodes to `20` in UTF-8 hex)? – Martijn Pieters Oct 09 '17 at 08:14
  • I can't figure out what codec you are thinking of here. U+030C maps to the COMBINING CARON codepoint in the Unicode standard, for example. – Martijn Pieters Oct 09 '17 at 08:17
  • @MartijnPieters `0063` in hex corresponds to the 'c' in utf-8 (would be U+0063). This is easy to see if you just use the code above. The `030C` corresponds to the COMBINING CARON, as you said. As I said in the question, this is shown as a space in my shell (probably because my shell is not able to map it to something). Honestly, I do not understand what is wrong with my question. I did not put much attention to the COMBINING CARON just because it was not really important to answer the question. But if you think, I can write something different that can be easily mapped by my shell. – Riccardo Petraglia Oct 09 '17 at 08:29
  • @MartijnPieters I think now should be more clear based on your comments. Otherwise, just let me know. – Riccardo Petraglia Oct 09 '17 at 08:41
  • Right, you appear to have confused *Unicode codepoints* with UTF-8. U+0063 LATIN SMALL LETTER C is `63` in UTF-8, while U+030C COMBINING CARON is `CC8C`. Unicode codepoints != UTF-8. Perhaps you are thinking of UTF-16 (big endian order) instead? – Martijn Pieters Oct 09 '17 at 08:58
  • Note that the link you included in your question *includes encoding examples*. Look closely at the UTF-8 and UTF-16 examples in the *Representations* section. – Martijn Pieters Oct 09 '17 at 09:05
  • It's a pity you don't have more data to show us. The CC8C might have been another clue, but *more data* from your actual usecase would have been helpful in identifying what codec you really have. It is not UTF-8, at any rate. – Martijn Pieters Oct 09 '17 at 09:18
  • The basic idea is that I want to remap some non-ascii char to ascii (there is a software I cannot access the code that works only with ascii so I need some basic mapping). I was just playing with the "decomposition mapping" field of the codepoints.net website. The chars there are stored as unicode-codepoint that I was trying to convert to "normal" string. – Riccardo Petraglia Oct 09 '17 at 09:24
  • That'd be a whole new can of worms. See [Python: Convert Unicode to ASCII without errors](//stackoverflow.com/q/2365411) – Martijn Pieters Oct 09 '17 at 09:28

2 Answers2

2

You don't have UTF-8 data, if 0063 is U+0063 LATIN SMALL LETTER C. At best you have UTF-16 data, big endian order:

>>> binary.fromhex('0063').decode('utf-16-be')
'c'

You may want to check if your full data starts with a Byte Order Mark, for big-endian UTF-16 that'd be 'FEFF' in hex, at which point you can drop the -be suffix as the decoder will know what byte order to use. If your data starts with 'FFFE' instead, you have little-endian encoded UTF-16 and you sliced your data at the wrong point; in that case you took along the '00' byte for the preceding codepoint.

UTF-8 is a variable width encoding. The first 128 codepoints in the Unicode standard (corresponding with the ASCII range), encode directly to single bytes, mapping directly to the ASCII standard. Codepoints in the Latin-1 range and beyond (up to U+07FF(*), the next 1919 codepoints) map to two bytes, etc.

If your input really was UTF-8, then you really have a \x00 NULL character before that 'c'. Printing a NULL results in no output on many terminals, but you can use cat -v to turn such non-printable characters into caret escape codes:

$ python3 -c "print('\x00c')"
c
$ python3 -c "print('\x00c')" | cat -v
^@c

^@ is the representation for a NULL in the caret notation used by cat.


(*) U+07FF is not currently mapped in Unicode; the last UTF-8 two-byte codepoint currently possible is U+07FA NKO LAJANYALAN.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Ok... Maybe I am starting to understand this stuff. Unicode is a set of conventions on how to store the chars in the memory. utf-8 is following those conventions using only 8-bit. When I encode something which require more than 8-bit, that will be encoded using 16-bit and so on (this is done in an automagically way: the official encode is still utf-8). This works in encoding. When I want to decode something, I must know "a-priori" how many bits I am going to use. This means that if I have a non-ascii char I cannot use utf-8 for sure. Is this right? – Riccardo Petraglia Oct 09 '17 at 09:19
  • 2
    UTF-8 is one of a set of possible *serialisations* of the Unicode text. Unicode is much more than just a bunch of codepoints; those conventions go beyond mere serialisation. UTF-8 can represent everything in Unicode, using a variable number of bytes. UTF-16 and UTF-32 are other serialisations, and they use a fixed number of bytes (2 and 4) per codepoint (where UTF-16 would use 2x 2 bytes for Unicode codepoints outside of the [BMP](https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane), called surrogate pairs). – Martijn Pieters Oct 09 '17 at 09:23
  • 1
    8-bit is not a characterisation to apply here. You need to know, a-priory, what serialisation standard (codec) was used. You can do *some* finger-printing, if your data starts with `0000FEFF` or `FFFE0000` then can assume, with high probability, that you have data using UTF-32 as the codec, for example. – Martijn Pieters Oct 09 '17 at 09:25
  • Sorry to bore you, but I am learning a lot from this discussion. The idea is that if you go from str->bytes you can use utf-8 because that is using a variable number of bytes. While when you go bytes->str you must know the codec used to create those bytes otherwise you cannot interpret the bytes. This break something I was believing: utf-8 is just the most space-saving among all the utf-* but they are interchangeable. – Riccardo Petraglia Oct 09 '17 at 09:32
  • 1
    @RiccardoPetraglia: Most codecs are not interchangeable. When you encode from `str` to `bytes`, you made a *conscious choice* to use the UTF-8 codec. You could have picked a different codec too. If you always settle for UTF-8, then you can always use the same codec too. If you don't, you need to record your selected codec somewhere. In XML documents, the first [XML declaration](https://en.wikipedia.org/wiki/XML#Key_terminology) is such a place. In HTML, a [`` tag](https://stackoverflow.com/questions/4696499/meta-charset-utf-8-vs-meta-http-equiv-content-type) is often used. – Martijn Pieters Oct 09 '17 at 09:35
  • 1
    @RiccardoPetraglia: in other words, if you are dealing with data from arbitrary sources, look for standard indicators for the codec, including the documented standard for the format. – Martijn Pieters Oct 09 '17 at 09:36
  • @RiccardoPetraglia: I recommend you read up on Unicode and codecs: https://nedbatchelder.com/text/unipain.html and https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ are good starting points. – Martijn Pieters Oct 09 '17 at 09:37
  • Thank you very much for your patient. I already read something but apparently was not really enough.... :) I will take the time to read what you pointed me out. Thank you again. – Riccardo Petraglia Oct 09 '17 at 09:39
1
a = ['0063'] # Which is the hex equivalent to the c char.
b = [chr(int(x,16)) for x in a]
print(b)

Thanks to 1

Ahmad Yoosofan
  • 961
  • 12
  • 21
  • @ Martijn Pieters Just to understand better: is this solution agnostic of the codec used? (Maybe I should do a new question). – Riccardo Petraglia Oct 13 '17 at 09:01
  • It works as your question needs and work for any Unicode character. You may need to use another input instead of using array of string of hex numbers as input. – Ahmad Yoosofan Oct 19 '17 at 17:24