How force scrapy to parse second page when finished first one

Question

I am using Scrapy version 1.5.1. I crated parser, which parse urls from main page, then parse urls from already parsed urls, etc. Scrapy works asynchronously and makes parallel connections. The problem is, that I have some logic which urls should parse first, creating sets of urls which I already visited, maximum urls to visit etc.

At first, I configure CONCURRENT_REQUESTS_PER_DOMAIN=1 and CONCURRENT_REQUESTS=1, but it did not help, because I think there is scheduler that cache url which will process next and then, perform it in different order.

What I need to do, is to force scrapy to process one url and wait till it is finished and then, start parsing new url, etc. Is there a way how to configure scrapy to do this?

This is a very common question and what you are looking for is called request chaining. e.g. https://stackoverflow.com/questions/38753743 or https://stackoverflow.com/questions/41660978 — Granitosaurus, Oct 28 '18 at 06:38
Possible duplicate of [Scrapy merge subsite-item with site-item](https://stackoverflow.com/questions/38753743/scrapy-merge-subsite-item-with-site-item) — Granitosaurus, Oct 28 '18 at 06:40
Well I have item, object consist of functions to parse and store urls. I pass this object over metadata of request. Right now, I process 3 domains. The output json consist of 3 domains, but visited pages are mixed. Domain 1 visited urls consist of urls from all 3 domains. The same for domain 2 and 3. I need to process first domain, then second domain and then third domain, etc... — dorinand, Nov 05 '18 at 19:47

score 0 · Answer 1 · answered Oct 31 '18 at 15:22

0

Try to use yield response.follow instead of yield Request: https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.TextResponse.follow

answered Oct 31 '18 at 15:22

vezunchik

3,669
3
16
25

How force scrapy to parse second page when finished first one

1 Answers1