Python unicode code point issues: \xe2\x82\x82 vs. CO\u2082

Question

My program is required to take in inputs but I am having an issues with subscripts such as CO₂...

So when i use CO₂ as an argument into the function, it seems to be represented as a string: 'CO\xe2\x82\x82' which is apparently the string literal?

Further on, i read from a spreadsheet - xlsx file using read_excel() from pandas to find entries pertaining to CO₂. I then convert this into a dictionary but in this case, it is represented as 'CO\u2082'

I use the args from earlier represented as: 'CO\xe2\x82\x82' so it doesn't recognize an entry for CO\u2082... which then results in a key error.

My question is what would be a way to convert both these representations of CO₂ so that i can do look-ups in the dictionary? Thank you for any advice

There is a normalize() function that might be what you're after, see [this answer](https://stackoverflow.com/a/16467505/2280890) — import random, Sep 14 '22 at 03:36

Pi Marillion · Answer 1 · 2022-09-16T01:49:03.710

Looks like your input to the function is encoded as UTF-8, while the XLSX file is in decoded Unicode.

b'\xe2\x82\x82' is the UTF-8 encoding of Unicode codepoint '\u2082' which is identical to '₂' on Unicode-enabled systems.

Most modern systems are unicode enabled, so the most common reason to see the former UTF-8 encoding is due to reading bytes data, which is always encoded. You can fix that by decoding it like so:

> data = b'CO\xe2\x82\x82'
> data.decode()
'CO₂'

If the encoded data are somehow in a normal (non-bytes) string, then you can do it by converting the existing string to bytes and then decoding it:

> data = 'CO\xe2\x82\x82'
> bytes(map(ord, data)).decode()
'CO₂'

From @mark-tolonen below, using the latin-1 encoding is functionally identical to bytes(map(ord, data)), but much, much faster:

> data = 'CO\xe2\x82\x82'
> data.encode('latin1').decode()
'CO₂'

FYI, `data.encode('latin1').decode()` is 4x more efficient (via timeit). The latin1 codec maps 1:1 the first 256 Unicode codepoints to byte values. — Mark Tolonen, Sep 14 '22 at 17:04
@MarkTolonen Thanks! That's a great tip, and I'll be using it in the future now. — Pi Marillion, Sep 16 '22 at 01:49

Python unicode code point issues: \xe2\x82\x82 vs. CO\u2082

1 Answers1