16

I have read a lot of answers regarding web scraping that talk about BeautifulSoup, Scrapy e.t.c. to perform web scraping.

Is there a way to do the equivalent of saving a page's source from a web brower?

That is, is there a way in Python to point it at a website and get it to save the page's source to a text file with just the standard Python modules?

Here is where I got to:

import urllib

f = open('webpage.txt', 'w')
html = urllib.urlopen("http://www.somewebpage.com")

#somehow save the web page source

f.close()

Not much I know - but looking for code to actually pull the source of the page so I can write it. I gather that urlopen just makes a connection.

Perhaps there is a readlines() equivalent for reading lines of a web page?

martineau
  • 119,623
  • 25
  • 170
  • 301
Fusilli Jerry
  • 745
  • 4
  • 8
  • 16
  • 2
    Welcome to Stack Overflow! We encourage you to [research your questions](http://stackoverflow.com/questions/how-to-ask). If you've [tried something already](http://whathaveyoutried.com/), please add it to the question - if not, research and attempt your question first, and then come back. –  Nov 11 '12 at 14:49
  • 1
    Thanks! Am still very new to the site so sorry if I approached this the wrong way. Will add some code of where I got to :) – Fusilli Jerry Nov 11 '12 at 14:53

3 Answers3

31

You may try urllib2:

import urllib2

page = urllib2.urlopen('http://stackoverflow.com')

page_content = page.read()

with open('page_content.html', 'w') as fid:
    fid.write(page_content)
btel
  • 5,563
  • 6
  • 37
  • 47
2

Updated code, for Python 3 (where urllib2 is deprecated):

from urllib.request import urlopen
html = urlopen("http://www.google.com/")
with open('page_content.html', 'w') as fid:
    fid.write(html)
SoHei
  • 274
  • 1
  • 4
  • 10
1

Answer from SoHei will not work because it's missing html.read() and the file must be opened with 'wb' parameter instead of just a 'w'. The 'b' indicates that data will be written in binary mode (since .read() returns sequence of bytes). The fully working code is:

from urllib.request import urlopen
html = urlopen("http://www.google.com/")
page_content = html.read()
with open('page_content.html', 'wb') as fid:
     fid.write(page_content)
  • This does not retrieve the same content as when I navigate to my target page and "view page source" - not sure if that is a problem unique to the page I am viewing (requires logon and page source has scripts and embedded json that don't show when read and saved as above). – James Jan 11 '20 at 05:12