Distinguishing between HTML and non-HTML pages in Scrapy

Question

I am building a Spider in Scrapy that follows all the links it can find, and sends the url to a pipeline. At the moment, this is my code:

from scrapy import Spider
from scrapy.http import Request
from scrapy.http import TextResponse
from scrapy.selector import Selector
from scrapyTest.items import TestItem
import urlparse


class TestSpider(Spider):
name = 'TestSpider'
allowed_domains = ['pyzaist.com']
start_urls = ['http://pyzaist.com/drone']

def parse(self, response):
    item = TestItem()
    item["url"] = response.url
    yield item

    links = response.xpath("//a/@href").extract()
    for link in links:
        yield Request(urlparse.urljoin(response.url, link))

This does the job, but throws an error whenever the response is just a Response, not a TextResponse or HtmlResponse. This is because there is no Response.xpath(). I tried to test for this by doing:

if type(response) is TextResponse:
    links = response.xpath("//a@href").extract()
    ...

But to no avail. When I do that, it never enters the if statement. I am new to Python, so it might be a language thing. I appreciate any help.

score 1 · Accepted Answer · edited May 23 '17 at 11:51

1

Nevermind, I found the answer. type() only gives information on the immediate type. It tells nothing of inheritance. I was looking for isinstance(). This code works:

if isinstance(response, TextResponse):
    links = response.xpath("//a/@href").extract()
    ...

https://stackoverflow.com/a/2225066/1455074, near the bottom

edited May 23 '17 at 11:51

Community

1
1

answered Jun 17 '15 at 20:50

tschwab

1,056
1
12
27

Distinguishing between HTML and non-HTML pages in Scrapy

1 Answers1