6

Same thing asked 2.5 years ago in Downloading a web page and all of its resource files in Python but doesn't lead to an answer and the 'please see related topic' isn't really asking the same thing.

I want to download everything on a page to make it possible to view it just from the files.

The command

wget --page-requisites --domains=DOMAIN --no-parent --html-extension --convert-links --restrict-file-names=windows

does exactly that I need. However we want to be able to tie it in with other stuff that must be portable, so requires it to be in Python.

I've been looking at Beautiful Soup, scrapy, various spiders posted around the place, but these all seem to deal with getting data/links in clever but specific ways. Using these to do what I want seems like it will require a lot of work to deal with finding all of the resources, when I'm sure there must be an easy way.

thanks very much

Community
  • 1
  • 1
Conrad
  • 341
  • 3
  • 5
  • import urllib urllib.urlretrieve('http://www.somesite.com/file.whatever', 'filename to be downloaded as') – CR0SS0V3R Feb 10 '12 at 00:32
  • 1
    so I know that I can download a singular file in that manner, but I'll need to use a crawler and set many conditions to find all of the files that I want (everything to be able to view a section of a website offline). There must be something around that downloads website and requisites in Python? – Conrad Feb 10 '12 at 01:47
  • you could use a parsing function within a for-loop to search for links within the file downloaded (or read from where ever) – CR0SS0V3R Feb 11 '12 at 22:59
  • this is what we're doing. To be honest I thought that it was going to be harder than it was to find the page decencies (images, css) but the links to it are there in the pages to be found and added to a set. – Conrad Feb 13 '12 at 06:07
  • [`scrapy`](https://scrapy.org/) seems to have evolved to be very flexible. Have you tried to get it to do what you want more recently? Can you clarify what you want that it can't do? – nealmcb Nov 22 '19 at 17:17
  • And what about [pywebcopy: Python library to mirror webpage and websites\.](https://github.com/rajatomar788/pywebcopy)? – nealmcb Nov 22 '19 at 20:34

2 Answers2

3

You should be using an appropriate tool for the job at hand.

If you want to spider a site and save the pages to disk, Python probably isn't the best choice for that. Open source projects get features when someone needs that feature, and because wget does its job so well, nobody has bothered to try to write a python library to replace it.

Considering wget runs on pretty much any platform that has a Python interpreter, is there a reason you can't use wget?

ironchefpython
  • 3,409
  • 1
  • 19
  • 32
  • you make a good point that nobody would write one for python, the only reason I haven't pursued the wget route is I was asked to do it in Python.... I'm guessing they want to reduce dependencies. We have now pretty much written the tool in Python for our narrow use. Will post it up here if allowed – Conrad Feb 13 '12 at 06:09
2

My colleague wrote up this code, lots pieced together from other sources I believe. Might have some specific quirks for our system but it should help anyone wanting to do the same

"""
    Downloads all links from a specified location and saves to machine.
    Downloaded links will only be of a lower level then links specified.
    To use: python downloader.py link
"""
import sys,re,os,urllib2,urllib,urlparse
tocrawl = set([sys.argv[1]])
# linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?')
linkregex = re.compile('href=[\'|"](.*?)[\'"].*?')
linksrc = re.compile('src=[\'|"](.*?)[\'"].*?')
def main():
    link_list = []##create a list of all found links so there are no duplicates
    restrict = sys.argv[1]##used to restrict found links to only have lower level
    link_list.append(restrict)
    parent_folder = restrict.rfind('/', 0, len(restrict)-1)
    ##a.com/b/c/d/ make /d/ as parent folder
    while 1:
        try:
            crawling = tocrawl.pop()
            #print crawling
        except KeyError:
            break
        url = urlparse.urlparse(crawling)##splits url into sections
        try:
            response = urllib2.urlopen(crawling)##try to open the url
        except:
            continue
        msg = response.read()##save source of url
        links = linkregex.findall(msg)##search for all href in source
        links = links + linksrc.findall(msg)##search for all src in source
        for link in (links.pop(0) for _ in xrange(len(links))):
            if link.startswith('/'):
                ##if /xxx a.com/b/c/ -> a.com/b/c/xxx
                link = 'http://' + url[1] + link
            elif ~link.find('#'):
                continue
            elif link.startswith('../'):
                if link.find('../../'):##only use links that are max 1 level above reference
                    ##if ../xxx.html a.com/b/c/d.html -> a.com/b/xxx.html
                    parent_pos = url[2].rfind('/')
                    parent_pos = url[2].rfind('/', 0, parent_pos-2) + 1
                    parent_url = url[2][:parent_pos]
                    new_link = link.find('/')+1
                    link = link[new_link:]
                    link = 'http://' + url[1] + parent_url + link
                else:
                    continue
            elif not link.startswith('http'):
                if url[2].find('.html'):
                    ##if xxx.html a.com/b/c/d.html -> a.com/b/c/xxx.html
                    a = url[2].rfind('/')+1
                    parent = url[2][:a]
                    link = 'http://' + url[1] + parent + link
                else:
                    ##if xxx.html a.com/b/c/ -> a.com/b/c/xxx.html
                    link = 'http://' + url[1] + url[2] + link
            if link not in link_list:
                link_list.append(link)##add link to list of already found links
                if (~link.find(restrict)):
                ##only grab links which are below input site
                    print link ##print downloaded link
                    tocrawl.add(link)##add link to pending view links
                    file_name = link[parent_folder+1:]##folder structure for files to be saved
                    filename = file_name.rfind('/')
                    folder = file_name[:filename]##creates folder names
                    folder = os.path.abspath(folder)##creates folder path
                    if not os.path.exists(folder):
                        os.makedirs(folder)##make folder if it does not exist
                    try:
                        urllib.urlretrieve(link, file_name)##download the link
                    except:
                        print "could not download %s"%link
                else:
                    continue
if __name__ == "__main__":
    main()

thanks for the replies

Conrad
  • 341
  • 3
  • 5
  • I am novice in programming, Can you please tell me how do I use this code? I also want to download everything linked to a webpage and open it locally and I am also asked to do it in Python. –  Jan 29 '13 at 04:44
  • Where do I put my link and where is my page getting saved? –  Jan 29 '13 at 04:44
  • 4
    ouch.. use an html parser – Corey Goldberg Sep 29 '14 at 11:57