0

My question is very similar to this one but not exactly the same. I am attempting to crawl webpage A, which contains items I want to scrap and a list of links to webpages B1, B2, B3, ... Each B page contains items I want to scrap and a link to another page, C1, C2, C3. Each C page also contains items I want to scrap.

Here's what my spider looks like for now:

class XiaomiSpider(CrawlSpider):
    name = 'xiaomi'
    allowed_domains = ['mi.com']
    start_urls = ['http://app.mi.com']

    ## parse A page
    def parse(self, response):
        le = LinkExtractor()
        for link in le.extract_links(response):
            yield SplashRequest(
                link.url, 
                callback = self.parse_2,
                endpoint='render.html',
                args={'wait':0.5}
                )

    ## parse B pages
    def parse_2(self, response):
        for sel in response.xpath('//ul[@class = "applist"]/li/h5/a'):
            item = XiaomiItem()
            # print (sel.xpath('text()').extract())
            item['name'] = sel.xpath('text()').extract()
            yield item

        le = LinkExtractor()
        for link in le.extract_links(response):
            yield SplashRequest(
                link.url, 
                self.parse_dir_contents,
                endpoint='render.html',
                args={'wait':0.5}
                )

    ## parse a page to save the item
    def parse_dir_contents(self, response):
        for sel in response.xpath('//ul[@class = "applist"]/li/h5/a'):
            item = XiaomiItem()
            # print (sel.xpath('text()').extract())
            item['name'] = sel.xpath('text()').extract()
            yield item

The issue is that the crawler did not go to C pages. I guess the problem is in the parse_2 function, but don't know how to fix it. Can anyone show me what the problem is?

Community
  • 1
  • 1
user3768495
  • 4,077
  • 7
  • 32
  • 58
  • This should be what you want, [`Request callback`](http://doc.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions) – Harrison Jun 19 '16 at 00:45
  • Thanks @Harrison. I looked into it but felt that this method does not involve scrapy-splash, is that correct? Maybe I should have stated clearly that, in my problem, the B pages have 'javascript-generated' urls to C pages. So I guess I have to use splashRequest, right? – user3768495 Jun 19 '16 at 05:29
  • If the javascript generated content is loaded using ajax, then you can use Browser Inspect to check the ajax request url. More details [here](http://stackoverflow.com/questions/8550114/can-scrapy-be-used-to-scrape-dynamic-content-from-websites-that-are-using-ajax) – Harrison Jun 19 '16 at 08:57

0 Answers0