While trying to get the html source of an... "academic" site I have trouble with decoding. I am using the requests commands:
resp = requests.get(url)
print(resp.content)
edit: I did try resp.text
The result is something like this:
"b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00C\x00\".
Bytes. Cool. I tried using .decode("format")
with various formats mentioned here (iso
, latin
, utf
, cp
) but I had no luck.
Here is what some of those printed:
utf-8:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
latin-1:
"ÿØÿàJFIFÿÛC 2! !222222222222222222222222ĵ}"
iso8859_2:
"˙Ř˙ŕJFIF˙ŰC 2!!2222222222"
edit 2: As per this Q&A I cannot post the link, or refer to the webpage.
Even though this question is about decoding the source, it would also be great if you could point towards alternative solutions (i.e. for the others methods I tried; see below)
1) I tried using selenium but the following prevents it from getting the source: "Accessibility support is partially disabled due to compatibility issues with new Firefox features." (The problem seems to be an add-on that is required to login to the site)
Selenium code:
driver = webdriver.Firefox()
driver.get(url)
htmlSource = driver.page_source
driver.quit()
soup = BeautifulSoup(htmlSource,'lxml')
2) Using urllib didn't work either, and it threw an HTTPError 302 infinite loop. I tried using a cookiejar but to no avail.