Correctly decoding hex escaped unicode strings in python

Question

I am using RoboBrowser (which uses BeautifulSoup) to extract links from a website, some of these links contain unicode characters. However I am having trouble getting python to interpret it correctly.

For example, a link contains this Cyrillic character

п

Which is URL encoded as

%D0%BF

Beautiful soup will spit out

u'\xd0\xbf'

Which looks correct to me but prints out

Ð¿

which corresponds to the byte array

'c3 90 c2 bf'

The correct encoding appears to be

u'\u043f'

Which gives the correct byte array and also prints correctly

u'\u043f'.encode("utf-8").encode("hex")
'd0bf'

I'm guessing I'm doing something wrong so the question is how do I get from

u'\xd0\xbf' to u'\u043f'

What code are you using to get it from the page? The URL is correctly encoded, something is decoding the URL-encoded *bytes* as Latin-1, rather than UTF-8. — Martijn Pieters, Nov 15 '17 at 10:59
You can fix this by using `.encode('latin1').decode('utf8')` but I want to see if there is anything that can be done to avoid the issue in the first place. — Martijn Pieters, Nov 15 '17 at 11:00
Awesome, perfect @martijn-pieters I am doing something along the lines of `a = domObject.select('a')` `url = urllib.unquote(a.get("href"))` Also, I don't quite get why u'\xd0\xbf', is that not the correct byte sequence? — user3143516, Nov 15 '17 at 11:12
Hmm, looks like i've been blaming the wrong component, looks like it is a urllib bug https://bugs.python.org/issue8136 which was deemed too dangerous to fix and this is now the 'intended' behavior — user3143516, Nov 15 '17 at 11:39
In Python 3, [`urllib.parse.unquote()`](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.unquote) defaults to UTF-8 and is configurable. The Python 2 equivalent just gives you Latin-1, unconditionally. — Martijn Pieters, Nov 15 '17 at 11:44

Correctly decoding hex escaped unicode strings in python

0 Answers0