I am building a Spider in Scrapy that follows all the links it can find, and sends the url to a pipeline. At the moment, this is my code:
from scrapy import Spider
from scrapy.http import Request
from scrapy.http import TextResponse
from scrapy.selector import Selector
from scrapyTest.items import TestItem
import urlparse
class TestSpider(Spider):
name = 'TestSpider'
allowed_domains = ['pyzaist.com']
start_urls = ['http://pyzaist.com/drone']
def parse(self, response):
item = TestItem()
item["url"] = response.url
yield item
links = response.xpath("//a/@href").extract()
for link in links:
yield Request(urlparse.urljoin(response.url, link))
This does the job, but throws an error whenever the response is just a Response, not a TextResponse or HtmlResponse. This is because there is no Response.xpath(). I tried to test for this by doing:
if type(response) is TextResponse:
links = response.xpath("//a@href").extract()
...
But to no avail. When I do that, it never enters the if statement. I am new to Python, so it might be a language thing. I appreciate any help.