Get html using Python requests?

Question

I am trying to teach myself some basic web scraping. Using Python's requests module, I was able to grab html for various websites until I tried this:

>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')

Instead of the basic html that is the source for this page, I get:

>>> r.text
'\x1f\ufffd\x08\x00\x00\x00\x00\x00\x00\x03\ufffd]o\u06f8\x12\ufffd\ufffd\ufffd+\ufffd]...

>>> r.content
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\x9d]o\xdb\xb8\x12\x86\xef\xfb+\x88]\x14h...

I have tried many combinations of get/post with every syntax I can guess from the documentation and from SO and other examples. I don't understand what I am seeing above, haven't been able to turn it into anything I can read, and can't figure out how to get what I actually want. My question is, how do I get the html for the above page?

Seems to work here, just tried it with the exact url on Python 2.7 — This company is turning evil., Jan 06 '15 at 17:04
Id highly recommend BeautifulSoup for web scraping http://beautiful-soup-4.readthedocs.org/en/latest/#. It will make your life a heck of a lot easier. — Ron, Jan 06 '15 at 17:04
@vikasdumca: `requests` **is built on top of** `urllib3`. The problem is the server here, however. — Martijn Pieters, Jan 06 '15 at 17:29
@Ron: you first need to get HTML text, which the OP doesn't have. It is gzipped data. — Martijn Pieters, Jan 06 '15 at 17:29
@Mani: that'd only work if you actually had HTML data, not compressed data because the server screwed up. — Martijn Pieters, Jan 06 '15 at 17:30
works fine for me using requests, what version of requests are you using? — Padraic Cunningham, Jan 06 '15 at 17:35
@PadraicCunningham: the server seems to respond rather randomly; I was wondering why I got uncompressed data for *one* response but not for another. I get an injected ` ` header. Explicitly asking for uncompressed data seems to work more reliably. — Martijn Pieters, Jan 06 '15 at 17:38

Martijn Pieters · Accepted Answer · 2015-02-07T22:03:39.040

The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:

$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html

The <!DOCTYPE..> line there is not a valid HTTP header. As such, the remaining headers past Server are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py is a CGI script that doesn't output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.

As such, requests also doesn't detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn't rather incomplete.

The work-around is to tell the server not to bother with compression:

headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)

and an uncompressed response is returned.

Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:

>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
 'connection': 'Keep-Alive',
 'content-encoding': 'gzip',
 'content-length': '3659',
 'content-type': 'text/html',
 'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
 'keep-alive': 'timeout=5, max=100',
 'server': 'Apache',
 'vary': 'Accept-Encoding'}

and the content-encoding information survives, so there requests decodes the content for you, as expected.

Yep, it is a python 3 problem. Works perfectly every time using python2 — Padraic Cunningham, Jan 06 '15 at 17:42
@PadraicCunningham: no, it is a server problem. Python 2 just happens to not validate the header properly. It works in Python 2 but you get the ` ` line as a header. — Martijn Pieters, Jan 06 '15 at 17:43
@MartijnPieters: It turns out that when I use the work around, the response content is corrupted by the addition of an occasional extra characters starting with the data for 1934. Based on your explanation, I instead decompressed the response content with `zlib.decompress(r.content, 16+zlib.MAX_WBITS)`, which seems to have handled all issues. — Rich Thompson, Feb 07 '15 at 22:02
FYI, the HTTP headers have now been fixed for this URL. I apologize for the error. — Grant, Feb 12 '15 at 19:03

score 12 · Answer 2 · answered Feb 12 '15 at 19:05

The HTTP headers for this URL have now been fixed.

>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
>>> r.text[:100]
u'\n<!DOCTYPE html>\n<HTML>\n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
>>> r.headers
{'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}

score 9 · Answer 3 · answered Apr 13 '21 at 20:05

9

I'd solve that problem in a more simple way. Just import html library to decode HTML special characters:

import html

r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')

print(html.unescape(r.text))

answered Apr 13 '21 at 20:05

Ângelo Polotto

8,463
2
36
37

3

+rep 8 years later – Padua Dec 19 '21 at 18:38

score 9 · Answer 4 · answered Jan 27 '22 at 15:50

Here is an example using the BeautifulSoup library. It "makes it easy to scrape information from web pages."

from bs4 import BeautifulSoup

import requests

# request web page
resp = requests.get("http://example.com")

# get the response text. in this case it is HTML
html = resp.text

# parse the HTML
soup = BeautifulSoup(html, "html.parser")

# print the HTML as text
print(soup.body.get_text().strip())

and the result

Example Domain
This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
More information...

Get html using Python requests?

4 Answers4

Linked