25

Currently I have a script that can only download the HTML of a given page.

Now I want to download all the files of the web page including HTML, CSS, JS and image files (same as we get with a ctrl-s of any website).

My current code is:

import urllib
url = "https://en.wikipedia.org/wiki/Python_%28programming_language%29"
urllib.urlretrieve(url, "t3.html")

I visited many questions but they are all only downloading the HTML.

Vikas Yadav
  • 3,094
  • 2
  • 20
  • 21
Rahul Satal
  • 367
  • 1
  • 4
  • 9
  • So you want to go through the links in the HTML and download the content they point to? Note that a Wikipedia page contains links to other pages; do you want to do that recursively? – jonrsharpe Jul 03 '15 at 11:21
  • Yes i want to download all the links in the main link along with their css and js files. – Rahul Satal Jul 03 '15 at 11:29
  • Or just tell me how to download only one given pages css and js files – Rahul Satal Jul 03 '15 at 11:41
  • **Decompose the problem**. Break it down into small steps, and research each one separately. You know how to get the first page, so now work out how to extract the links you want from the HTML (hint: this is called parsing). – jonrsharpe Jul 03 '15 at 12:11
  • @jonrsharpe i just know to download the HTML of first web page but its css files are not downloading – Rahul Satal Jul 03 '15 at 12:25
  • If you've written some code, it's not working and you can't figure out why, post a [minimal example](http://stackoverflow.com/help/mcve) and a precise description of the problem with it. – jonrsharpe Jul 03 '15 at 12:26

4 Answers4

23

The following implementation enables you to get the sub-HTML websites. It can be more developed in order to get the other files you need. I sat the depth variable for you to set the maximum sub_websites that you want to parse to.

import urllib2
from BeautifulSoup import *
from urlparse import urljoin


def crawl(pages, depth=None):
    indexed_url = [] # a list for the main and sub-HTML websites in the main website
    for i in range(depth):
        for page in pages:
            if page not in indexed_url:
                indexed_url.append(page)
                try:
                    c = urllib2.urlopen(page)
                except:
                    print "Could not open %s" % page
                    continue
                soup = BeautifulSoup(c.read())
                links = soup('a') #finding all the sub_links
                for link in links:
                    if 'href' in dict(link.attrs):
                        url = urljoin(page, link['href'])
                        if url.find("'") != -1:
                                continue
                        url = url.split('#')[0] 
                        if url[0:4] == 'http':
                                indexed_url.append(url)
        pages = indexed_url
    return indexed_url


pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=2)
print urls

Python3 version, 2019. May this saves some time to somebody:

#!/usr/bin/env python


import urllib.request as urllib2
from bs4 import *
from urllib.parse  import urljoin


def crawl(pages, depth=None):
    indexed_url = [] # a list for the main and sub-HTML websites in the main website
    for i in range(depth):
        for page in pages:
            if page not in indexed_url:
                indexed_url.append(page)
                try:
                    c = urllib2.urlopen(page)
                except:
                    print( "Could not open %s" % page)
                    continue
                soup = BeautifulSoup(c.read())
                links = soup('a') #finding all the sub_links
                for link in links:
                    if 'href' in dict(link.attrs):
                        url = urljoin(page, link['href'])
                        if url.find("'") != -1:
                                continue
                        url = url.split('#')[0] 
                        if url[0:4] == 'http':
                                indexed_url.append(url)
        pages = indexed_url
    return indexed_url


pagelist=["https://en.wikipedia.org/wiki/Python_%28programming_language%29"]
urls = crawl(pagelist, depth=1)
print( urls )
jaromrax
  • 274
  • 1
  • 12
Sam Al-Ghammari
  • 1,021
  • 7
  • 23
20

You can easily do that with simple python library pywebcopy.

For Current version: 5.0.1


from pywebcopy import save_webpage

url = 'http://some-site.com/some-page.html'
download_folder = '/path/to/downloads/'    

kwargs = {'bypass_robots': True, 'project_name': 'recognisable-name'}

save_webpage(url, download_folder, **kwargs)

You will have html, css, js all at your download_folder. Completely working like original site.

rajatomar788
  • 447
  • 4
  • 9
  • 1
    this downloads just this specific page `some-page.html`? can it crawl based on a base url like get just the pages under `http://some-site.com/projects/specific-sub-folder/` ? – Ulf Gjerdingen Jan 13 '21 at 11:49
6

Using Python 3+ Requests and other standard libraries.

The function savePage receives a requests.Response and the pagefilename where to save it.

  • Saves the pagefilename.html on the current folder
  • Downloads, javascripts, css and images based on the tags script, link and img and saved on a folder pagefilename_files.
  • Any exception are printed on sys.stderr, returns a BeautifulSoup object .
  • Requests session must be a global variable unless someone writes a cleaner code here for us.

You can adapt it to your needs.


import os, sys
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

def soupfindAllnSave(pagefolder, url, soup, tag2find='img', inner='src'):
    if not os.path.exists(pagefolder): # create only once
        os.mkdir(pagefolder)
    for res in soup.findAll(tag2find):   # images, css, etc..
        try:
            filename = os.path.basename(res[inner])  
            fileurl = urljoin(url, res.get(inner))
            # rename to saved file path
            # res[inner] # may or may not exist 
            filepath = os.path.join(pagefolder, filename)
            res[inner] = os.path.join(os.path.basename(pagefolder), filename)
            if not os.path.isfile(filepath): # was not downloaded
                with open(filepath, 'wb') as file:
                    filebin = session.get(fileurl)
                    file.write(filebin.content)
        except Exception as exc:      
            print(exc, file=sys.stderr)
    return soup

def savePage(response, pagefilename='page'):    
   url = response.url
   soup = BeautifulSoup(response.text)
   pagefolder = pagefilename+'_files' # page contents 
   soup = soupfindAllnSave(pagefolder, url, soup, 'img', inner='src')
   soup = soupfindAllnSave(pagefolder, url, soup, 'link', inner='href')
   soup = soupfindAllnSave(pagefolder, url, soup, 'script', inner='src')    
   with open(pagefilename+'.html', 'w') as file:
      file.write(soup.prettify())
   return soup

Example saving google page and its contents (google_files folder)

session = requests.Session()
#... whatever requests config you need here
response = session.get('https://www.google.com')
savePage(response, 'google')
imbr
  • 6,226
  • 4
  • 53
  • 65
2

Try the Python library Scrapy. You can program Scrapy to recursively scan a website by downloading its pages, scanning, following links:

An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

Wtower
  • 18,848
  • 11
  • 103
  • 80
  • Thanks @barny , but can you please tell can it be implemented using beautifulSoup lib or HTTP Requests bcoz i am having some knowledge of it. – Rahul Satal Jul 03 '15 at 11:39
  • Good heavens, my Answer has been Revised. Read the python, err, Python library Scrapy documentation, for example the FAQ says as its first answer: Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead. http://doc.scrapy.org/en/1.0/faq.html – DisappointedByUnaccountableMod Jul 03 '15 at 12:08