0

I'm a beginner and stuck into a problem. I have 30 links to crawl. But the crawler should stop crawling next links after it meets a certain condition (break_flag==True). I put a dummy condition like stop crawling when the count==2. But spider always scraps all the 30 provided links. I raise CloseSpider() exception but nothing is making sense. It always scraps all the provided links. Another problem that I am facing is the spider chooses random links to crawl, I want them to be crawled in sequence as given.

My Spider

class IkmanSpider(scrapy.Spider):
    name = 'ikman'
    allowed_domains = ['ikman.lk']
    start_urls = ['https://ikman.lk/en/ads/sri-lanka/property?page=' + str(i) for i in range(1, 30)]
    main_url = 'https://ikman.lk'
    #Differnce between current date and last scrapped date
    days_diff = GoogleSheet().duration_from_last_run()
    count = 0

    def parse(self, response):
        self.count += 1
        break_flag = False
        objs = list()
        links = set()
        boxes = response.css('.list--3NxGO li')
        for box in boxes:
            l = box.css('a::attr(href)')[0].extract()
            try:
                time = box.css('.updated-time--1DbCk::text')[0].extract()
                print('time: ', time)
                if 'day' in time:
                    day = int(str(time).split(' ')[0].strip())
                    print('Posted day:', day)
                    if self.days_diff <= day:
                        break_flag = True
                        continue
            except:
                pass
            l = self.main_url + l
            if l not in links:
                obj = PropertiesLinkItem()
                obj['link'] = l
                obj["status"] = '0'
                # scraping Date
                obj['s_date'] = str(datetime.now().day) + '-' + str(datetime.now().month) + '-' + str(
                    datetime.now().year)
                objs.append(obj)
                links.add(l)
        if break_flag or self.count == 2:
            print("Stop Scraping")
            raise CloseSpider('All newly added Links has been Scrapped')
        yield {'data': objs}
Dosti
  • 33
  • 6
  • Be careful about using a bare `except:` like that, see https://stackoverflow.com/questions/54948548/what-is-wrong-with-using-a-bare-except. – AMC Mar 14 '20 at 16:25
  • Thanks for your suggestions, But It wouldn't solve my problem. – Dosti Mar 14 '20 at 16:27

3 Answers3

2

If you wish not to scrape all the 30 start_urls you have to change architecture of the file slightly. You'll have to chain requests from one to another by adding pagination parameter. Yieding next page request if your requirements not met yet. It's most normal way.

Otherwise you can use this hacky trick: How do I stop all spiders and the engine immediately after a condition in a pipeline is met?

But this will require more manipulations.

Michael Savchenko
  • 1,445
  • 1
  • 9
  • 13
1

You can try this one:

COUNT_MAX = 5

custom_settings = {
    'CLOSESPIDER_PAGECOUNT': COUNT_MAX
}

It worked for me.

Protik Nag
  • 511
  • 5
  • 20
0

I think this error happens, because line if break_flag or self.count == 2 is out of the for box in boxes loop.

So your program runs through all of the boxes and then checks if it should stop or not. The solution is simple: just move if break_flag or self.count == 2 inside the for loop.

  • No, It is perfect. I want to loop through all over the boxes, But I don't want it to make further requests for crawling when a certain condition meets. Just have a look at the start_urls, these are 30. – Dosti Mar 14 '20 at 16:24
  • It also shows "All newly added Links has been Scrapped" this message in log, means the pieice of code works, but it always executed at the end of crawling 30 pages. – Dosti Mar 14 '20 at 16:29