0

I already posted last week, my bot is always blocked at the 321 page. I changed Scrapy settings but noticed that pages between 321 and the end seems to contain no item.

I'd like to know how to skip the pages generating mistakes. I tried this:

    next_pages = response.xpath("//div[@class='pgLightPrevNext']/a/@href").extract() #essai pour accéder au contenu des pages suivantes
    for next in next_pages:
        absolute_url = self.base_url + next
        try:
            yield scrapy.Request(absolute_url, callback=self.parse_dir_contents)
        except:
            pass

But with no result. How can I skip those pages?

Thanks.

Kalamarico
  • 5,466
  • 22
  • 53
  • 70
  • what you mean by mistake?? what do they contain?? simply do not yield a request if you dont want to scrape them – Umair Ayub Oct 22 '18 at 07:49
  • Hello Umair, problem occurs around page 321 each time, I have still the same code 503 or 504 no matter what I change is setting ... There is no product information displayed on this page until the end. I would like to skip those pages without product item which are throwing errors –  Oct 22 '18 at 07:59
  • The problem is not in request its in the extract of information. check the status code of the url while yielding the response and skip if status code is 500 or 504 etc – Pavan Kumar T S Oct 22 '18 at 10:35
  • follow this for checking status code in callback function https://stackoverflow.com/a/9698718/7887883 – Pavan Kumar T S Oct 22 '18 at 11:10
  • I tried to add meta={"handle_httpstatus_list": [503, 504]}, in the above code, but it doesn't seems to "skip" the mistake.. the spider closes –  Oct 22 '18 at 12:12
  • This is not directly related to your question but there is a more friendly pattern for relative URLs that doesn't require maintaining a `base_url` attribute for every spider. E.G. in context above `absolute_url = response.urljoin(next)` See here: here https://doc.scrapy.org/en/latest/topics/request-response.html?highlight=urljoin#scrapy.http.Response.urljoin – pwinz Oct 23 '18 at 00:30

2 Answers2

0

You can return if the number of items collected for a page is 0.

Apalala
  • 9,017
  • 3
  • 30
  • 48
0

In the next function, where you fetch the data, check if response == 200. If the response not equals to 200 you can retry that url using another function using a retry variable under certain limit. If the limit is crossed go for the next product url.

try:
        if response.status == 404:
            self.append(self.bad_log_file, response.url)
            self.append(self.fourohfour, response.url)

        elif response.status == 200:
            self.append(self.ok_log_file, response.url)
        else:
            self.append(self.bad_log_file, response.url)

    except Exception, e:
        self.log('[eccezione] : %s' % e)
        pass
Agus Mathew
  • 806
  • 1
  • 10
  • 19