I have a very specific situation with one scraper I am developing right now. The first function parse_posts_pages iterates through all the pages from a specific forum page and for each page, it calls the second function parse_posts.
def parse_posts_pages(self, response):
thread_id = response.meta['thread_id']
thread_link = response.meta['thread_link']
thread_name = response.meta['thread_name']
if len(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')) == 3:
posts_per_page = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[1])
total_posts = int(response.xpath('//*[@id="postpagestats_above"]/text()').re(r'(\d+)')[2])
if posts_per_page > 0:
post_mod = total_posts % posts_per_page
pages = total_posts / posts_per_page
if post_mod > 0: pages += 1
else: pages = 1
for page in range(pages, 0, -1):
cur_page = '' if page == 1 else '/page' + str(page)
post_page_link = thread_link + cur_page
return scrapy.Request(post_page_link, self.parse_posts, meta={'thread_id': thread_id, 'thread_name': thread_name})
def parse_posts(self, response):
global maxPostIDByThread, executeFullSpider
thread_id = response.meta['thread_id']
thread_name = response.meta['thread_name']
for post in response.xpath('//*[@id="posts"]/li'):
post_id = post.xpath('@id').re(r'(\d.*)')[0]
if not executeFullSpider and post_id in maxPostIDByThread:
break #<- I need this break to also cancel the for from parse_posts_pages function
...
In the second function there is an if condition. When this conditions resolves to true I need to break the current for loop AND also the for loop from parse_posts_pages as there is no need to continue the pagination.
Is there any way to stop the for loop in the first function from the second function?