0

I am using RoboBrowser (which uses BeautifulSoup) to extract links from a website, some of these links contain unicode characters. However I am having trouble getting python to interpret it correctly.

For example, a link contains this Cyrillic character

п

Which is URL encoded as

%D0%BF

Beautiful soup will spit out

u'\xd0\xbf'

Which looks correct to me but prints out

п

which corresponds to the byte array

'c3 90 c2 bf'

The correct encoding appears to be

u'\u043f'

Which gives the correct byte array and also prints correctly

u'\u043f'.encode("utf-8").encode("hex")
'd0bf'

I'm guessing I'm doing something wrong so the question is how do I get from

u'\xd0\xbf' to u'\u043f'
  • What code are you using to get it from the page? The URL is correctly encoded, something is decoding the URL-encoded *bytes* as Latin-1, rather than UTF-8. – Martijn Pieters Nov 15 '17 at 10:59
  • You can fix this by using `.encode('latin1').decode('utf8')` but I want to see if there is anything that can be done to avoid the issue in the first place. – Martijn Pieters Nov 15 '17 at 11:00
  • `u'\xd0\xbf'.encode('latin1').decode('utf8')` – Stop harming Monica Nov 15 '17 at 11:09
  • Awesome, perfect @martijn-pieters I am doing something along the lines of `a = domObject.select('a')` `url = urllib.unquote(a.get("href"))` Also, I don't quite get why u'\xd0\xbf', is that not the correct byte sequence? – user3143516 Nov 15 '17 at 11:12
  • Hmm, looks like i've been blaming the wrong component, looks like it is a urllib bug https://bugs.python.org/issue8136 which was deemed too dangerous to fix and this is now the 'intended' behavior – user3143516 Nov 15 '17 at 11:39
  • In Python 3, [`urllib.parse.unquote()`](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.unquote) defaults to UTF-8 and is configurable. The Python 2 equivalent just gives you Latin-1, unconditionally. – Martijn Pieters Nov 15 '17 at 11:44

0 Answers0