Scrapy Spider:relative link and absolute link

Question

There is an example in Scrapy Documentation Release 1.0.3,in 7th row, urljoin method is used when the links is relative.when the links is absolute,what should i do?

example code:

import scrapy


class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['http://stackoverflow.com/questions?sort=votes']

    def parse(self, response):
        for href in response.css('.question-summary h3 a::attr(href)'):
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response):
        yield {
            'title': response.css('h1 a::text').extract()[0],
            'votes': response.css('.question .vote-count-post::text').extract()[0],
            'body': response.css('.question .post-text').extract()[0],
            'tags': response.css('.question .post-tag::text').extract(),
            'link': response.url,
        }

You can only get the `href` value and store in `full_url` : `full_url = href.extract()` — general03, Oct 02 '15 at 11:56

score 2 · Answer 1 · answered Oct 02 '15 at 13:33

You don't need to worry about, urljoin() handles both cases properly:

In [1]: response.urljoin("http://stackoverflow.com/questions/426258/checking-a-checkbox-with-jquery")
Out[1]: 'http://stackoverflow.com/questions/426258/checking-a-checkbox-with-jquery'

In [2]: response.urljoin("/questions/426258/checking-a-checkbox-with-jquery")
Out[2]: 'http://stackoverflow.com/questions/426258/checking-a-checkbox-with-jquery'

Scrapy Spider:relative link and absolute link

1 Answers1