How to read html from a url in python 3

Question

I looked at previous similar questions and got only more confused.

In python 3.4, I want to read an html page as a string, given the url.

In perl I do this with LWP::Simple, using get().

A matplotlib 1.3.1 example says: import urllib; u1=urllib.urlretrieve(url). python3 can't find urlretrieve.

I tried u1 = urllib.request.urlopen(url), which appears to get an HTTPResponse object, but I can't print it or get a length on it or index it.

u1.body doesn't exist. I can't find a description of the HTTPResponse in python3.

Is there an attribute in the HTTPResponse object which will give me the raw bytes of the html page?

(Irrelevant stuff from other questions include urllib2, which doesn't exist in my python, csv parsers, etc.)

Edit:

I found something in a prior question which partially (mostly) does the job:

u2 = urllib.request.urlopen('http://finance.yahoo.com/q?s=aapl&ql=1')

for lines in u2.readlines():
    print (lines)

I say 'partially' because I don't want to read separate lines, but just one big string.

I could just concatenate the lines, but every line printed has a character 'b' prepended to it.

Where does that come from?

Again, I suppose I could delete the first character before concatenating, but that does get to be a kloodge.

Here's the description of [`HTTPResponse` objects](https://docs.python.org/3/library/http.client.html#httpresponse-objects) in the Python 3 documentation. — martineau, Aug 24 '15 at 19:49

score 126 · Answer 1 · edited Aug 24 '15 at 19:42

126

Note that Python3 does not read the html code as a string but as a bytearray, so you need to convert it to one with decode.

import urllib.request

fp = urllib.request.urlopen("http://www.python.org")
mybytes = fp.read()

mystr = mybytes.decode("utf8")
fp.close()

print(mystr)

edited Aug 24 '15 at 19:42

martineau

119,623
25
170
301

answered Jun 17 '15 at 11:18

davidgh

1,293
1
9
7

The `fp` object has `readlines()` method, at least in Python version **3.6.1**. – RajaRaviVarma Jan 02 '18 at 09:24
8

not a good idea to assume its UTF-8 encoded. You should try and read the header – CpILL Jun 25 '18 at 07:28
I can't write mystr to text file. I get this error every time I run the program : `return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 369774-369777: character maps to ` – Detained Developer Sep 16 '18 at 18:36

score 112 · Answer 2 · edited Jun 26 '19 at 17:37

112

Try the 'requests' module, it's much simpler.

#pip install requests for installation

import requests

url = 'https://www.google.com/'
r = requests.get(url)
r.text

more info here > http://docs.python-requests.org/en/master/

edited Jun 26 '19 at 17:37

James Riordan

1,139
1
10
25

answered Jan 25 '17 at 22:21

Aaron T.

1,197
2
9
10

1

`import requests` is Python 2, isn't it? – Fabien Snauwaert Apr 25 '20 at 17:43
7

what do you mean? import libname is used in py3 too – Sir Von Berker Jul 21 '20 at 19:45
From the website: "Requests officially supports Python 2.7 & 3.6+, and runs great on PyPy." – tenfishsticks Aug 26 '21 at 18:30

score 16 · Answer 3 · answered Jun 11 '14 at 01:59

16

urllib.request.urlopen(url).read() should return you the raw HTML page as a string.

answered Jun 11 '14 at 01:59

2

@user1067305 strange... `request.urlopen()` [returns an `HTTPResponse`](https://docs.python.org/3.4/library/urllib.request.html?highlight=urllib.request#urllib.request.urlopen), and [they do have](https://docs.python.org/3.4/library/http.client.html#http.client.HTTPResponse.read) the `read()` method... – Jun 11 '14 at 02:11
OK! I tried it this way:u2 = urllib.request.urlopen('http://finance.yahoo.com/q?s=aapl&ql=1') junk = u2.read() print(junk) – user1067305 Jun 11 '14 at 02:33

score 15 · Answer 4 · edited Feb 03 '18 at 22:01

15

import requests

url = requests.get("http://yahoo.com")
htmltext = url.text
print(htmltext)

This will work similar to urllib.urlopen.

edited Feb 03 '18 at 22:01

hoefling

59,418
12
147
194

answered Dec 03 '17 at 18:54

Ramandeep Singh

552
6
11

score 13 · Answer 5 · answered Feb 05 '18 at 17:08

Reading an html page with urllib is fairly simple to do. Since you want to read it as a single string I will show you.

Import urllib.request:

#!/usr/bin/python3.5

import urllib.request

Prepare our request

request = urllib.request.Request('http://www.w3schools.com')

Always use a "try/except" when requesting a web page as things can easily go wrong. urlopen() requests the page.

try:
    response = urllib.request.urlopen(request)
except:
    print("something wrong")

Type is a great function that will tell us what 'type' a variable is. Here, response is a http.response object.

print(type(response))

The read function for our response object will store the html as bytes to our variable. Again type() will verify this.

htmlBytes = response.read()

print(type(htmlBytes))

Now we use the decode function for our bytes variable to get a single string.

htmlStr = htmlBytes.decode("utf8")

print(type(htmlStr))

If you do want to split up this string into separate lines, you can do so with the split() function. In this form we can easily iterate through to print out the entire page or do any other processing.

htmlSplit = htmlStr.split('\n')

print(type(htmlSplit))

for line in htmlSplit:
    print(line)

Hopefully this provides a little more detailed of an answer. Python documentation and tutorials are great, I would use that as a reference because it will answer most questions you might have.

not a good idea to assume its UTF-8 encoded. You should try and read the header — CpILL, Jun 25 '18 at 07:28
@CpILL good catch. I agree, while utf-8 is widely used, you could potentially run into issues. — Discoveringmypath, Jun 26 '18 at 01:26

score 2 · Answer 6 · edited Aug 03 '20 at 08:36

2

For python 2

import urllib
some_url = 'https://docs.python.org/2/library/urllib.html'
filehandle = urllib.urlopen(some_url)
print filehandle.read()

edited Aug 03 '20 at 08:36

Ali Pardhan

194
1
14

answered Jun 11 '14 at 02:05

agamike

479
3
5

3

Might specify it is for Python2? As I checked `urllib.urlopen` is not there for Python3. – junhan May 10 '20 at 20:37

bauderr · Answer 7 · 2023-06-29T22:32:49.810

I think the b'' that is prepended to the lines is to signify that it is a bytes string, which is what you asked for. To decode the bytes object:

b'Some html text'.decode()

It will decode in utf-8. However, it is best to decode with the encoding specified in the headers.

Im not sure if this works on Python 3.4 but this is how it is done:

import requests 
page = requests.get('https://www.mslscript.com')
html_text = page.text
encoded_html = html_text.encode(page.encoding)
decoded_html = encoded_html.decode(page.encoding)

To make it in one line of text is simple:

# Remove all the CRLF chars
while '\n' in decoded_html:
     decoded_html = decoded_html.replace('\n','')

# Remove all the extra spaces,
#   you could even replace with ''
while '  ' in decoded_html:
    decoded_html = decoded_html.replace('  ',' ')

# Remove tabs '\t', maybe not.
while '\t' in decoded_html:
    decoded_html = decoded_html.replace('\t','')

You can also make use of requests-async, a powerful library that is compatible with Python 3.6 and is particularly effective when used in conjunction with Trio. Where as requests latest version works on py -3.7. You may want to upgrade your python version to atleast Python 3.8 if possible.

How to read html from a url in python 3

7 Answers7

Linked