Decoding bytes doesn't seem to decode

Question

While trying to get the html source of an... "academic" site I have trouble with decoding. I am using the requests commands:

resp = requests.get(url)
print(resp.content)

edit: I did try resp.text

The result is something like this:

"b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00C\x00\".

Bytes. Cool. I tried using .decode("format") with various formats mentioned here (iso, latin, utf, cp) but I had no luck.

Here is what some of those printed:

utf-8:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

latin-1:

"ÿØÿàJFIFÿÛC         2! !222222222222222222222222Äµ}"

iso8859_2:

"˙Ř˙ŕJFIF˙ŰC         2!!2222222222"

edit 2: As per this Q&A I cannot post the link, or refer to the webpage.

Even though this question is about decoding the source, it would also be great if you could point towards alternative solutions (i.e. for the others methods I tried; see below)

1) I tried using selenium but the following prevents it from getting the source: "Accessibility support is partially disabled due to compatibility issues with new Firefox features." (The problem seems to be an add-on that is required to login to the site)

Selenium code:

driver = webdriver.Firefox()
driver.get(url)
htmlSource = driver.page_source
driver.quit()
soup = BeautifulSoup(htmlSource,'lxml')

2) Using urllib didn't work either, and it threw an HTTPError 302 infinite loop. I tried using a cookiejar but to no avail.

It returns: UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 0-1: illegal UTF-16 surrogate — Kostas Mouratidis, Apr 02 '17 at 15:34
JFIF stands for JPEG File Interchange Format. You're trying to decode an image as text. — Ilja Everilä, Apr 02 '17 at 15:40
@IljaEverilä, thanks, that makes some sense since there is an image in the source code (...), but driver.page_source is supposed to get the html source, which I don't see how it translates to a JPEG file. — Kostas Mouratidis, Apr 02 '17 at 15:51

score 0 · Accepted Answer · edited May 23 '17 at 11:54

0

As per https://stackoverflow.com/a/41068125/7432972:

resp.text should return Unicode text in your case.

Please do post back and let me know if this works or not as I've never had that problem personally before, possibly because I did always use request_response.text except for when feeding the response into bs4.

EDIT:

As per @Ilja_Everilä, you got an image as a response instead of the source you were looking for. I'd check what response code you receive for that request (resp.status_code), there's a chance it won't be 200, meaning the server returns some other message as a response. If that is the case, changing user-agent to something else may fix it, although it seems like the website in question doesn't want requests from the requests module, at the very least.

Or, even more likely, it has to do with that addon you mentioned that's needed for login. It's possible to add an addon to a selenium.webdriver.FirefoxProfile() through .add_extension('/path/to/addon'). Any sort of configuration for the addon, however, I could not help with.

edited May 23 '17 at 11:54

Community

1
1

answered Apr 02 '17 at 15:25

VlB

46
1
1
5

Oh, I forgot to mention in the question that I did try .text too. Same result:��JFIF��C 2!!22 What about bs4 though? I do plan to use it later – Kostas Mouratidis Apr 02 '17 at 15:39
@KostasMouratidis, as for `bs4`, it doesn't see `request_response.text` as valid HTML (unsurprisingly), so you just have to pass `request_response.content` to it instead, that's about all I know. – VlB Apr 02 '17 at 16:05
This doesn't seem to be the problem; the response code is 200, and when I tried to avoid requests I used urllib (the rest of the code is in the original question) by setting these custom headers (which worked for various other sites): `headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"` – Kostas Mouratidis Apr 02 '17 at 16:22
I added on another part just now, that's as far as my knowledge will get you, I'm afraid. – VlB Apr 02 '17 at 16:27
I'm having trouble finding the addon. I did find the path, and there are about 10 which are named something like: `p = "{f759ca51-3a91-4dd1-ae78-9db5eee9ebf0}.xpi"`. The issue isn't going through them 1 by 1, but that I can't even open one. I tried `add_extension(extension=p)` (and also tried using the whole path with `os.path.abspath`) but I get a unicode error (??). Being on windows 10 I then tried using double slashes (\\) for the path but that didn't work either. Any idea as to what might have caused this? – Kostas Mouratidis Apr 02 '17 at 16:52
I've encountered a similar problem and ended up downloading a fresh copy of the needed add-on from addons.mozilla.org. That particular one was pre-configured out of the box, however. I'm not sure if it's the path that's the problem for you, you can check with os.path.exists(':\\path\\to\\addon'). You could also try renaming a copy of that xpi file to something more legible. I think the real problem may be that xpi file is not the whole addon, however and is just part of a configuration. I'll stop replying now as this has gone way past the original question. – VlB Apr 02 '17 at 17:14
I did use the os.path.exists and it was True. I re-installed the addon and got the WHOLE thing to work!! It returned the source beautifully (pun intended). Then, I encountered a problem when sending the keys for login (ConnectionRefusedError [...] target machine actively refused it) but that's another thing. Anyway, that did work, and it is a very interesting thing to know for future reference. Thanks for the help! – Kostas Mouratidis Apr 02 '17 at 17:34

Decoding bytes doesn't seem to decode

1 Answers1