0

I am using scrapy to crawl data. The target website blocks the IP after it sends about 1000 requests.

To deal with this, I wrote a proxy middleware, and because the amount of data is relatively large, I also wrote a cache extension. When I enabled both of them, I get banned more often. It works well when only the proxy middleware is enabled.

I know that when scrapy engine start, extensions start earlier than middlewares. Could this be the reason? If not, what else should I consider?

Any suggestions will be appreciated!

Nikolay Shebanov
  • 1,363
  • 19
  • 33
Sherwin
  • 11
  • 1
  • Welcome to Stack Overflow! Could you clarify what are the extensions and middlewares you use? The question suggests both are custom-made, did I understand that correctly? – Nikolay Shebanov Feb 01 '21 at 10:56
  • 1
    @NikolayShebanov Thank you for helping me edit the page. The extensions and middlewares I mentioned are both custom-made, I found that the reason for the problem was that when I wrote the extension, I stored all the responses, even if its status code is not 200. I have modified the code and it works well now. Thank you again. – Sherwin Feb 02 '21 at 04:06
  • Glad to hear that! It might make sense to update the question with a snippet of your code and maybe provide a short write-up of the answer. That would help others who get into a similar situation. – Nikolay Shebanov Feb 02 '21 at 09:15

0 Answers0