-1

My program is required to take in inputs but I am having an issues with subscripts such as CO₂...

So when i use CO₂ as an argument into the function, it seems to be represented as a string: 'CO\xe2\x82\x82' which is apparently the string literal?

Further on, i read from a spreadsheet - xlsx file using read_excel() from pandas to find entries pertaining to CO₂. I then convert this into a dictionary but in this case, it is represented as 'CO\u2082'

I use the args from earlier represented as: 'CO\xe2\x82\x82' so it doesn't recognize an entry for CO\u2082... which then results in a key error.

My question is what would be a way to convert both these representations of CO₂ so that i can do look-ups in the dictionary? Thank you for any advice

Cal
  • 1
  • There is a normalize() function that might be what you're after, see [this answer](https://stackoverflow.com/a/16467505/2280890) – import random Sep 14 '22 at 03:36

1 Answers1

1

Looks like your input to the function is encoded as UTF-8, while the XLSX file is in decoded Unicode.

b'\xe2\x82\x82' is the UTF-8 encoding of Unicode codepoint '\u2082' which is identical to '₂' on Unicode-enabled systems.

Most modern systems are unicode enabled, so the most common reason to see the former UTF-8 encoding is due to reading bytes data, which is always encoded. You can fix that by decoding it like so:

> data = b'CO\xe2\x82\x82'
> data.decode()
'CO₂'

If the encoded data are somehow in a normal (non-bytes) string, then you can do it by converting the existing string to bytes and then decoding it:

> data = 'CO\xe2\x82\x82'
> bytes(map(ord, data)).decode()
'CO₂'

From @mark-tolonen below, using the latin-1 encoding is functionally identical to bytes(map(ord, data)), but much, much faster:

> data = 'CO\xe2\x82\x82'
> data.encode('latin1').decode()
'CO₂'
Pi Marillion
  • 4,465
  • 1
  • 19
  • 20