I am new in Scrapy and didn't found any help so far.
I want to make a small scraper that can scrape all the url's on the page and then hit them one by one and if Url returns any down-loadable file of any extension then download it and save it into specified location. Here's the code that I have written : items.py
import scrapy
class ZcrawlerItem(scrapy.Item):
file = scrapy.Field()
file_url = scrapy.Field()
spider.py
from scrapy import Selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request
DOMAIN = 'example.com'
URL = 'http://%s' % DOMAIN
from crawler.items import CrawlerItem
class MycrawlerSpider(CrawlSpider):
name = "mycrawler"
allowed_domains = [DOMAIN]
start_urls = [
URL
]
def parse_dir_contents(self, response):
print(response.headers)
item = CrawlerItem()
item['file_url'] = response.url
return item
def parse(self, response):
hxs = Selector(response)
for url in hxs.xpath('//a/@href').extract():
if (url.startswith('http://') or url.startswith('https://')):
yield Request(url, callback=self.parse_dir_contents)
for url in hxs.xpath('//iframe/@src ').extract():
yield Request(url, callback=self.parse_dir_contents)
The issues that I am facing are the parse_dir_contents not showing header, So it's become difficult to check whether the response data is any down-loadable file or just a content.
BTW I am using Scrapy 1.1.0 and Python 3.4
Any help would be really appreciated!!