My question is very similar to this one but not exactly the same. I am attempting to crawl webpage A, which contains items I want to scrap and a list of links to webpages B1, B2, B3, ... Each B page contains items I want to scrap and a link to another page, C1, C2, C3. Each C page also contains items I want to scrap.
Here's what my spider looks like for now:
class XiaomiSpider(CrawlSpider):
name = 'xiaomi'
allowed_domains = ['mi.com']
start_urls = ['http://app.mi.com']
## parse A page
def parse(self, response):
le = LinkExtractor()
for link in le.extract_links(response):
yield SplashRequest(
link.url,
callback = self.parse_2,
endpoint='render.html',
args={'wait':0.5}
)
## parse B pages
def parse_2(self, response):
for sel in response.xpath('//ul[@class = "applist"]/li/h5/a'):
item = XiaomiItem()
# print (sel.xpath('text()').extract())
item['name'] = sel.xpath('text()').extract()
yield item
le = LinkExtractor()
for link in le.extract_links(response):
yield SplashRequest(
link.url,
self.parse_dir_contents,
endpoint='render.html',
args={'wait':0.5}
)
## parse a page to save the item
def parse_dir_contents(self, response):
for sel in response.xpath('//ul[@class = "applist"]/li/h5/a'):
item = XiaomiItem()
# print (sel.xpath('text()').extract())
item['name'] = sel.xpath('text()').extract()
yield item
The issue is that the crawler did not go to C pages. I guess the problem is in the parse_2
function, but don't know how to fix it. Can anyone show me what the problem is?