How to download a full webpage with a Python script?

Question

Currently I have a script that can only download the HTML of a given page.

Now I want to download all the files of the web page including HTML, CSS, JS and image files (same as we get with a ctrl-s of any website).

My current code is:

import urllib
url = "https://en.wikipedia.org/wiki/Python_%28programming_language%29"
urllib.urlretrieve(url, "t3.html")

I visited many questions but they are all only downloading the HTML.

So you want to go through the links in the HTML and download the content they point to? Note that a Wikipedia page contains links to other pages; do you want to do that recursively? — jonrsharpe, Jul 03 '15 at 11:21
Yes i want to download all the links in the main link along with their css and js files. — Rahul Satal, Jul 03 '15 at 11:29
Or just tell me how to download only one given pages css and js files — Rahul Satal, Jul 03 '15 at 11:41
**Decompose the problem**. Break it down into small steps, and research each one separately. You know how to get the first page, so now work out how to extract the links you want from the HTML (hint: this is called parsing). — jonrsharpe, Jul 03 '15 at 12:11
@jonrsharpe i just know to download the HTML of first web page but its css files are not downloading — Rahul Satal, Jul 03 '15 at 12:25
If you've written some code, it's not working and you can't figure out why, post a [minimal example](http://stackoverflow.com/help/mcve) and a precise description of the problem with it. — jonrsharpe, Jul 03 '15 at 12:26

score 23 · Accepted Answer · edited Jun 14 '19 at 13:27

The following implementation enables you to get the sub-HTML websites. It can be more developed in order to get the other files you need. I sat the depth variable for you to set the maximum sub_websites that you want to parse to.

import urllib2
from BeautifulSoup import *
from urlparse import urljoin


def crawl(pages, depth=None):
    indexed_url = [] # a list for the main and sub-HTML websites in the main website
    for i in range(depth):
        for page in pages:
            if page not in indexed_url:
                indexed_url.append(page)
                try:
                    c = urllib2.urlopen(page)
                except:
                    print "Could not open %s" % page
                    continue
                soup = BeautifulSoup(c.read())
                links = soup('a') #finding all the sub_links
                for link in links:
                    if 'href' in dict(link.attrs):
                        url = urljoin(page, link['href'])
                        if url.find("'") != -1:
                                continue
                        url = url.split('#')[0] 
                        if url[0:4] == 'http':
                                indexed_url.append(url)
        pages = indexed_url
    return indexed_url


pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=2)
print urls

Python3 version, 2019. May this saves some time to somebody:

#!/usr/bin/env python


import urllib.request as urllib2
from bs4 import *
from urllib.parse  import urljoin


def crawl(pages, depth=None):
    indexed_url = [] # a list for the main and sub-HTML websites in the main website
    for i in range(depth):
        for page in pages:
            if page not in indexed_url:
                indexed_url.append(page)
                try:
                    c = urllib2.urlopen(page)
                except:
                    print( "Could not open %s" % page)
                    continue
                soup = BeautifulSoup(c.read())
                links = soup('a') #finding all the sub_links
                for link in links:
                    if 'href' in dict(link.attrs):
                        url = urljoin(page, link['href'])
                        if url.find("'") != -1:
                                continue
                        url = url.split('#')[0] 
                        if url[0:4] == 'http':
                                indexed_url.append(url)
        pages = indexed_url
    return indexed_url


pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=1)
print( urls )

Works like a charm , but it does not answer the question which is how to dowload html with CSS and JS. Thx anyway. — Alexis, Apr 13 '20 at 11:41
this is simpler and better https://stackoverflow.com/a/62207356/1207193 — imbr, Aug 06 '20 at 15:08

rajatomar788 · Answer 2 · 2019-01-30T13:38:47.470

20

You can easily do that with simple python library pywebcopy.

For Current version: 5.0.1


from pywebcopy import save_webpage

url = 'http://some-site.com/some-page.html'
download_folder = '/path/to/downloads/'    

kwargs = {'bypass_robots': True, 'project_name': 'recognisable-name'}

save_webpage(url, download_folder, **kwargs)

You will have html, css, js all at your download_folder. Completely working like original site.

edited Jan 30 '19 at 13:38

answered Jul 26 '18 at 17:35

rajatomar788

447
4
9

1

this downloads just this specific page `some-page.html`? can it crawl based on a base url like get just the pages under `http://some-site.com/projects/specific-sub-folder/` ? – Ulf Gjerdingen Jan 13 '21 at 11:49

imbr · Answer 3 · 2020-06-06T16:20:03.343

Using Python 3+ Requests and other standard libraries.

The function savePage receives a requests.Response and the pagefilename where to save it.

Saves the pagefilename.html on the current folder
Downloads, javascripts, css and images based on the tags script, link and img and saved on a folder pagefilename_files.
Any exception are printed on sys.stderr, returns a BeautifulSoup object .
Requests session must be a global variable unless someone writes a cleaner code here for us.

You can adapt it to your needs.

import os, sys
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

def soupfindAllnSave(pagefolder, url, soup, tag2find='img', inner='src'):
    if not os.path.exists(pagefolder): # create only once
        os.mkdir(pagefolder)
    for res in soup.findAll(tag2find):   # images, css, etc..
        try:
            filename = os.path.basename(res[inner])  
            fileurl = urljoin(url, res.get(inner))
            # rename to saved file path
            # res[inner] # may or may not exist 
            filepath = os.path.join(pagefolder, filename)
            res[inner] = os.path.join(os.path.basename(pagefolder), filename)
            if not os.path.isfile(filepath): # was not downloaded
                with open(filepath, 'wb') as file:
                    filebin = session.get(fileurl)
                    file.write(filebin.content)
        except Exception as exc:      
            print(exc, file=sys.stderr)
    return soup

def savePage(response, pagefilename='page'):    
   url = response.url
   soup = BeautifulSoup(response.text)
   pagefolder = pagefilename+'_files' # page contents 
   soup = soupfindAllnSave(pagefolder, url, soup, 'img', inner='src')
   soup = soupfindAllnSave(pagefolder, url, soup, 'link', inner='href')
   soup = soupfindAllnSave(pagefolder, url, soup, 'script', inner='src')    
   with open(pagefilename+'.html', 'w') as file:
      file.write(soup.prettify())
   return soup

Example saving google page and its contents (google_files folder)

session = requests.Session()
#... whatever requests config you need here
response = session.get('https://www.google.com')
savePage(response, 'google')

score 2 · Answer 4 · edited Jul 03 '15 at 12:02

2

Try the Python library Scrapy. You can program Scrapy to recursively scan a website by downloading its pages, scanning, following links:

An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

edited Jul 03 '15 at 12:02

Wtower

18,848
11
103
80

answered Jul 03 '15 at 11:31

DisappointedByUnaccountableMod

6,656
4
18
22

Thanks @barny , but can you please tell can it be implemented using beautifulSoup lib or HTTP Requests bcoz i am having some knowledge of it. – Rahul Satal Jul 03 '15 at 11:39
Good heavens, my Answer has been Revised. Read the python, err, Python library Scrapy documentation, for example the FAQ says as its first answer: Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead. http://doc.scrapy.org/en/1.0/faq.html – DisappointedByUnaccountableMod Jul 03 '15 at 12:08

How to download a full webpage with a Python script?

4 Answers4

Linked

Related