0

I am using Scrapy version 1.5.1. I crated parser, which parse urls from main page, then parse urls from already parsed urls, etc. Scrapy works asynchronously and makes parallel connections. The problem is, that I have some logic which urls should parse first, creating sets of urls which I already visited, maximum urls to visit etc.

At first, I configure CONCURRENT_REQUESTS_PER_DOMAIN=1 and CONCURRENT_REQUESTS=1, but it did not help, because I think there is scheduler that cache url which will process next and then, perform it in different order.

What I need to do, is to force scrapy to process one url and wait till it is finished and then, start parsing new url, etc. Is there a way how to configure scrapy to do this?

dorinand
  • 1,397
  • 1
  • 24
  • 49
  • This is a very common question and what you are looking for is called request chaining. e.g. https://stackoverflow.com/questions/38753743 or https://stackoverflow.com/questions/41660978 – Granitosaurus Oct 28 '18 at 06:38
  • Possible duplicate of [Scrapy merge subsite-item with site-item](https://stackoverflow.com/questions/38753743/scrapy-merge-subsite-item-with-site-item) – Granitosaurus Oct 28 '18 at 06:40
  • Well I have item, object consist of functions to parse and store urls. I pass this object over metadata of request. Right now, I process 3 domains. The output json consist of 3 domains, but visited pages are mixed. Domain 1 visited urls consist of urls from all 3 domains. The same for domain 2 and 3. I need to process first domain, then second domain and then third domain, etc... – dorinand Nov 05 '18 at 19:47

1 Answers1

0

Try to use yield response.follow instead of yield Request: https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse.follow

vezunchik
  • 3,669
  • 3
  • 16
  • 25