Scrapy: What's the correct way to use start_requests()?

Question

This is how my spider is set up

class CustomSpider(CrawlSpider):
    name = 'custombot'
    allowed_domains = ['www.domain.com']
    start_urls = ['http://www.domain.com/some-url']
    rules = ( 
              Rule(SgmlLinkExtractor(allow=r'.*?something/'), callback='do_stuff', follow=True),
            )

    def start_requests(self):
        return Request('http://www.domain.com/some-other-url', callback=self.do_something_else)

It goes to /some-other-url but not /some-url. What is wrong here? The url specified in start_urls are the ones that need links extracted and sent through the rules filter, where as the ones in start_requests are sent directly to the item parser so it doesn't need to pass through the rules filters.

score 15 · Answer 1 · edited Jan 28 '16 at 08:27

15

From the documentation for start_requests, overriding start_requests means that the urls defined in start_urls are ignored.

This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests.
[...]
If you want to change the Requests used to start scraping a domain, this is the method to override.

If you want to just scrape from /some-url, then remove start_requests. If you want to scrape from both, then add /some-url to the start_urls list.

edited Jan 28 '16 at 08:27

sschuberth

28,386
6
101
146

answered Feb 11 '14 at 12:13

Talvalin

7,789
2
30
40

If I add /some-url to start_requests then how do I make it pass through the rules in rules() to set up the right callbacks?×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes. This is the scenario. The /some-url page contains links to other pages which needs to be extracted. The /some-other-url contains json responses so there are no links to extract and can be sent directly to the item parser. – Crypto Feb 11 '14 at 12:32
I hope this approach is correct but I used init_request instead of start_requests and that seems to do the trick. – Crypto Feb 11 '14 at 12:38
2

Possibly a bit late, but if you still need help then edit the question to post all of your spider code and a valid URL. :) – Talvalin Mar 01 '14 at 22:29

Scrapy: What's the correct way to use start_requests()?

1 Answers1

Linked