I want to scan a website and download the images in it.
For example, for a website URL like this: a.example.com/2vZBkE.jpg
, I need a bot to scan from a.example.com/aaaaaa.jpg
to a.example.com/AAAAAA.jpg
to a.example.com/999999.jpg
, and if there is an image, save the URL or download the image.
I tried using Python and Scrapy but I am very new to it. This is as far as I could go:
import scrapy
from scrapy.contrib.spiders import Rule, CrawlSpider
from scrapy.contrib.linkextractors import LinkExtractor
from example.items import ExampleItem
class exampleSpider(CrawlSpider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['http://a.example.com/2vZBkE']
#rules = [Rule(LinkExtractor(allow=['/.*']),'parse_example')]
rules = (Rule(SgmlLinkExtractor(allow=('\/%s\/.*',)), callback='parse_example'),
)
def parse_example(self,response):
image = ExampleItem()
image['title']=response.xpath(\
"//h5[@id='image-title']/text()").extract()
rel = response.xpath("//img/@src").extract()
image ['image_urls'] = ['http:'+rel[0]]
return image
I think I need to change this line:
rules = (Rule(SgmlLinkExtractor(allow=('\/%s\/.*',)), callback='parse_example'),
)
to somehow limit %s
to 6 characters and make Scrapy try possible combinations. Any ideas?