I have a specific requirement where I need to download more than 100 files from a website. I wanted to download all these files but the file names are not consecutive (E.g., www.blah.com/1 , www.blah.com/3, www.blah.com/4, www.blah.com/10). I wanted to go through the sequence and skip if that link is unavailable. What is the most efficient way to do that in python ?
Asked
Active
Viewed 277 times
-3
-
This reminds me very strongly of a post I have seen before, in the last few weeks. Did you ask this before, or have you posted it from another Stack Overflow question? – halfer Jul 11 '15 at 09:48
2 Answers
0
Assuming you have a normal TCP/IP connection the follwoing code maybe of use
import urllib2
def download(link, filename):
try:
req=urllib2.urlopen(link)
except urllib2.HTTPError:
pass
if req.has_data():
open(filename,"wb").write(req.get_data())
#here goes your function that loops over
uri="http://example.com/"
for x in xrange(1, 101):
download(uri+str(x), str(x))
just an example. modify as you please

Ronnie
- 512
- 5
- 12
0
You may want to look into Scrapy
web-scraping framework.
By default, Scrapy
would skip pages if the response status code is 4xx or 5xx.
You should start with something along these lines:
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
allowed_domains = ["domain.com"]
def start_requests(self)
for page in xrange(100):
yield scrapy.Request("http://domain.com/page/{page}".format(page=page))
def parse(self, response):
# TODO: parse page
Also, make sure you are a good web-scraping citizen, see: