Download HTML page and its contents

Question

Does Python have any way of downloading an entire HTML page and its contents (images, css) to a local folder given a url. And updating local html file to pick content locally.

Possible duplicate of [How can I download full webpage by a Python program?](https://stackoverflow.com/questions/31205497/how-can-i-download-full-webpage-by-a-python-program) — imbr, Aug 14 '19 at 12:30

score 44 · Accepted Answer · edited May 23 '17 at 12:10

You can use the urllib module to download individual URLs but this will just return the data. It will not parse the HTML and automatically download things like CSS files and images.

If you want to download the "whole" page you will need to parse the HTML and find the other things you need to download. You could use something like Beautiful Soup to parse the HTML you retrieve.

This question has some sample code doing exactly that.

score 13 · Answer 2 · answered Dec 01 '09 at 11:00

13

You can use the urlib:

import urllib.request

opener = urllib.request.FancyURLopener({})
url = "http://stackoverflow.com/"
f = opener.open(url)
content = f.read()

answered Dec 01 '09 at 11:00

Lucas

13,679
13
62
94

4

That only appears to download a page taking into account HTTP response codes; it doesn't actually download the page resources unless I'm missing something. – bdeniker Jun 30 '14 at 08:13
1

Unfortunately, this is now depecrared: DeprecationWarning: `FancyURLopener style of invoking requests is deprecated. Use newer urlopen functions/methods` – rien333 Aug 28 '19 at 21:43
1

FancyURLopener is deprecated for me, too. Google took me to this answer: https://stackoverflow.com/a/54261548/569302 – Jesus is Lord Dec 22 '22 at 01:51

score 11 · Answer 3 · answered Dec 01 '09 at 11:59

What you're looking for is a mirroring tool. If you want one in Python, PyPI lists spider.py but I have no experience with it. Others might be better but I don't know - I use 'wget', which supports getting the CSS and the images. This probably does what you want (quoting from the manual)

Retrieve only one HTML page, but make sure that all the elements needed for the page to be displayed, such as inline images and external style sheets, are also downloaded. Also make sure the downloaded page references the downloaded links.

wget -p --convert-links http://www.server.com/dir/page.html

imbr · Answer 4 · 2022-09-23T23:53:46.563

Function `savePage` bellow:

Saves the .html and downloaded javascripts, css and images based on the tags script, link and img (tags_inner dict keys).
Resource files are saved on folder with suffix _files.
Any exceptions are printed on sys.stderr

Uses Python 3+ Requests, BeautifulSoup and other standard libraries.

The function savePage receives a url and pagepath where to save it.

You can expand/adapt it to suit your needs

import os, sys, re
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

def savePage(url, pagepath='page'):
    def savenRename(soup, pagefolder, session, url, tag, inner):
        if not os.path.exists(pagefolder): # create only once
            os.mkdir(pagefolder)
        for res in soup.findAll(tag):   # images, css, etc..
            if res.has_attr(inner): # check inner tag (file object) MUST exists  
                try:
                    filename, ext = os.path.splitext(os.path.basename(res[inner])) # get name and extension
                    filename = re.sub('\W+', '', filename) + ext # clean special chars from name
                    fileurl = urljoin(url, res.get(inner))
                    filepath = os.path.join(pagefolder, filename)
                    # rename html ref so can move html and folder of files anywhere
                    res[inner] = os.path.join(os.path.basename(pagefolder), filename)
                    if not os.path.isfile(filepath): # was not downloaded
                        with open(filepath, 'wb') as file:
                            filebin = session.get(fileurl)
                            file.write(filebin.content)
                except Exception as exc:
                    print(exc, file=sys.stderr)
    session = requests.Session()
    #... whatever other requests config you need here
    response = session.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    path, _ = os.path.splitext(pagepath)
    pagefolder = path+'_files' # page contents folder
    tags_inner = {'img': 'src', 'link': 'href', 'script': 'src'} # tag&inner tags to grab
    for tag, inner in tags_inner.items(): # saves resource files and rename refs
        savenRename(soup, pagefolder, session, url, tag, inner)
    with open(path+'.html', 'wb') as file: # saves modified html doc
        file.write(soup.prettify('utf-8'))

Example saving google.com as google.html and contents on google_files folder. (current folder)

savePage('https://www.google.com', 'google')

With this example I get "bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?" — Wassadamo, Mar 25 '22 at 03:25
@Wassadamo you needed to install lxml package. But to simplify I just changed to "html.parser" in `soup = BeautifulSoup(response.text, "html.parser")` — imbr, Mar 25 '22 at 13:26

Download HTML page and its contents

4 Answers4

Function `savePage` bellow:

You can expand/adapt it to suit your needs

Linked

Download HTML page and its contents

4 Answers4

Function savePage bellow:

You can expand/adapt it to suit your needs

Linked

Function `savePage` bellow: