Inherent way to save web page source

Question

I have read a lot of answers regarding web scraping that talk about BeautifulSoup, Scrapy e.t.c. to perform web scraping.

Is there a way to do the equivalent of saving a page's source from a web brower?

That is, is there a way in Python to point it at a website and get it to save the page's source to a text file with just the standard Python modules?

Here is where I got to:

import urllib

f = open('webpage.txt', 'w')
html = urllib.urlopen("http://www.somewebpage.com")

#somehow save the web page source

f.close()

Not much I know - but looking for code to actually pull the source of the page so I can write it. I gather that urlopen just makes a connection.

Perhaps there is a readlines() equivalent for reading lines of a web page?

Welcome to Stack Overflow! We encourage you to [research your questions](http://stackoverflow.com/questions/how-to-ask). If you've [tried something already](http://whathaveyoutried.com/), please add it to the question - if not, research and attempt your question first, and then come back. — , Nov 11 '12 at 14:49
Thanks! Am still very new to the site so sorry if I approached this the wrong way. Will add some code of where I got to :) — Fusilli Jerry, Nov 11 '12 at 14:53

score 31 · Accepted Answer · answered Nov 11 '12 at 14:52

31

You may try urllib2:

import urllib2

page = urllib2.urlopen('http://stackoverflow.com')

page_content = page.read()

with open('page_content.html', 'w') as fid:
    fid.write(page_content)

answered Nov 11 '12 at 14:52

btel

5,563
6
37
47

5

To avoid encoding problems use `with open('page_content.html', 'wb') as fid:` – Steve Barnes May 15 '16 at 06:24

score 2 · Answer 2 · answered Feb 11 '18 at 18:22

2

Updated code, for Python 3 (where urllib2 is deprecated):

from urllib.request import urlopen
html = urlopen("http://www.google.com/")
with open('page_content.html', 'w') as fid:
    fid.write(html)

answered Feb 11 '18 at 18:22

SoHei

274
1
4
10

1

Error: TypeError: write() argument must be str, not HTTPResponse – Moondra Sep 05 '18 at 04:44

score 1 · Answer 3 · answered Dec 24 '18 at 07:19

1

Answer from SoHei will not work because it's missing html.read() and the file must be opened with 'wb' parameter instead of just a 'w'. The 'b' indicates that data will be written in binary mode (since .read() returns sequence of bytes). The fully working code is:

from urllib.request import urlopen
html = urlopen("http://www.google.com/")
page_content = html.read()
with open('page_content.html', 'wb') as fid:
     fid.write(page_content)

answered Dec 24 '18 at 07:19

DrManhattan

11
1

This does not retrieve the same content as when I navigate to my target page and "view page source" - not sure if that is a problem unique to the page I am viewing (requires logon and page source has scripts and embedded json that don't show when read and saved as above). – James Jan 11 '20 at 05:12

Inherent way to save web page source

3 Answers3

Linked

Related