Downloading a list of files from a website

Question

I have a specific requirement where I need to download more than 100 files from a website. I wanted to download all these files but the file names are not consecutive (E.g., www.blah.com/1 , www.blah.com/3, www.blah.com/4, www.blah.com/10). I wanted to go through the sequence and skip if that link is unavailable. What is the most efficient way to do that in python ?

This reminds me very strongly of a post I have seen before, in the last few weeks. Did you ask this before, or have you posted it from another Stack Overflow question? — halfer, Jul 11 '15 at 09:48

score 0 · Accepted Answer · answered Jul 09 '15 at 18:02

Assuming you have a normal TCP/IP connection the follwoing code maybe of use

import urllib2

def download(link, filename):
  try:
    req=urllib2.urlopen(link)
  except urllib2.HTTPError:
    pass
  if req.has_data():
    open(filename,"wb").write(req.get_data())

#here goes your function that loops over
uri="http://example.com/"
for x in xrange(1, 101):
  download(uri+str(x), str(x))

just an example. modify as you please

score 0 · Answer 2 · edited May 23 '17 at 10:30

You may want to look into Scrapy web-scraping framework.

By default, Scrapy would skip pages if the response status code is 4xx or 5xx.

You should start with something along these lines:

import scrapy

class MySpider(scrapy.Spider):
    name = "my_spider"
    allowed_domains = ["domain.com"]

    def start_requests(self)
        for page in xrange(100):
            yield scrapy.Request("http://domain.com/page/{page}".format(page=page))

    def parse(self, response):
        # TODO: parse page

Also, make sure you are a good web-scraping citizen, see:

Downloading a list of files from a website

2 Answers2

Linked