12

This is how my spider is set up

class CustomSpider(CrawlSpider):
    name = 'custombot'
    allowed_domains = ['www.domain.com']
    start_urls = ['http://www.domain.com/some-url']
    rules = ( 
              Rule(SgmlLinkExtractor(allow=r'.*?something/'), callback='do_stuff', follow=True),
            )

    def start_requests(self):
        return Request('http://www.domain.com/some-other-url', callback=self.do_something_else)

It goes to /some-other-url but not /some-url. What is wrong here? The url specified in start_urls are the ones that need links extracted and sent through the rules filter, where as the ones in start_requests are sent directly to the item parser so it doesn't need to pass through the rules filters.

Crypto
  • 1,217
  • 3
  • 17
  • 33

1 Answers1

15

From the documentation for start_requests, overriding start_requests means that the urls defined in start_urls are ignored.

This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests.
[...]
If you want to change the Requests used to start scraping a domain, this is the method to override.

If you want to just scrape from /some-url, then remove start_requests. If you want to scrape from both, then add /some-url to the start_urls list.

sschuberth
  • 28,386
  • 6
  • 101
  • 146
Talvalin
  • 7,789
  • 2
  • 30
  • 40
  • If I add /some-url to start_requests then how do I make it pass through the rules in rules() to set up the right callbacks?×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes×Comments may only be edited for 5 minutes. This is the scenario. The /some-url page contains links to other pages which needs to be extracted. The /some-other-url contains json responses so there are no links to extract and can be sent directly to the item parser. – Crypto Feb 11 '14 at 12:32
  • I hope this approach is correct but I used init_request instead of start_requests and that seems to do the trick. – Crypto Feb 11 '14 at 12:38
  • 2
    Possibly a bit late, but if you still need help then edit the question to post all of your spider code and a valid URL. :) – Talvalin Mar 01 '14 at 22:29