-3

I have a specific requirement where I need to download more than 100 files from a website. I wanted to download all these files but the file names are not consecutive (E.g., www.blah.com/1 , www.blah.com/3, www.blah.com/4, www.blah.com/10). I wanted to go through the sequence and skip if that link is unavailable. What is the most efficient way to do that in python ?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
sethu
  • 1,691
  • 4
  • 22
  • 34
  • This reminds me very strongly of a post I have seen before, in the last few weeks. Did you ask this before, or have you posted it from another Stack Overflow question? – halfer Jul 11 '15 at 09:48

2 Answers2

0

Assuming you have a normal TCP/IP connection the follwoing code maybe of use

import urllib2

def download(link, filename):
  try:
    req=urllib2.urlopen(link)
  except urllib2.HTTPError:
    pass
  if req.has_data():
    open(filename,"wb").write(req.get_data())

#here goes your function that loops over
uri="http://example.com/"
for x in xrange(1, 101):
  download(uri+str(x), str(x))

just an example. modify as you please

Ronnie
  • 512
  • 5
  • 12
0

You may want to look into Scrapy web-scraping framework.

By default, Scrapy would skip pages if the response status code is 4xx or 5xx.

You should start with something along these lines:

import scrapy

class MySpider(scrapy.Spider):
    name = "my_spider"
    allowed_domains = ["domain.com"]

    def start_requests(self)
        for page in xrange(100):
            yield scrapy.Request("http://domain.com/page/{page}".format(page=page))

    def parse(self, response):
        # TODO: parse page

Also, make sure you are a good web-scraping citizen, see:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195