8

I have a scrapy spider, but it doesn't return requests sometimes.

I've found that by adding log messages before yielding request and after getting response.

Spider has iterating over a pages and parsing link for item scrapping on each page.

Here is a part of code

SampleSpider(BaseSpider):
    ....
    def parse_page(self, response):
        ...
        request = Request(target_link, callback=self.parse_item_general)
        request.meta['date_updated'] = date_updated
        self.log('parse_item_general_send {url}'.format(url=request.url), level=log.INFO)
        yield request

    def parse_item_general(self, response):
        self.log('parse_item_general_recv {url}'.format(url=response.url), level=log.INFO)
        sel = Selector(response)
        ...

I've compared number of each log messages and "parse_item_general_send" is more than "parse_item_general_recv"

There's no 400 or 500 errors in final statistics, all responses status code is only 200. It looks like requests just disappears.

I've also added these parameters to minimize possible errors:

CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOAD_DELAY = 0.8

Because of asynchronous nature of twisted, I don't know how to debug this bug. I've found a similar question: Python Scrapy not always downloading data from website, but it hasn't any response

Community
  • 1
  • 1
Nikolai Golub
  • 3,327
  • 4
  • 31
  • 61
  • Try disabling the offsite middleware to see what happens. – R. Max Dec 22 '13 at 02:57
  • I've tried(based on [this example](http://doc.scrapy.org/en/latest/topics/spider-middleware.html), nothig has changed. Some requests are disappeared. from 2 to 5 from about 120 requests always disappear. – Nikolai Golub Dec 22 '13 at 06:05
  • Could you provide a minimal example that reproduces this issue? Otherwise will be hard to point out what's wrong as this is not a common issue. – R. Max Dec 22 '13 at 14:39
  • 3
    Alternatively, try adding `dont_filter=True` to your `Request` objects. Usually duplicates requests are filtered out without prior notice. There might happen that your requests get redirected to an already visited one and thus gets filtered. – R. Max Dec 22 '13 at 14:41
  • I've tried to create short demo script and it works w/o error. So, as expected, error is somewhere in the spider code. Probably I use yield wrong with conditions. I will update question, when findout root cause – Nikolai Golub Dec 25 '13 at 16:14

1 Answers1

1

On, the same note as Rho, you can add the setting

DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter' 

to your "settings.py" which will remove the url caching. This is a tricky issue since there isn't a debug string in the scrapy logs that tells you when it uses a cached result.

IamnotBatman
  • 342
  • 3
  • 7
  • I was having the same issue. Somehow, I was always losing 30 requests, and always the same requests. After setting this option in my settings.py file, everything worked just fine. – arthursfreire Apr 24 '17 at 20:13