Order of crawling in Scrapy

Question

I wrote a basic CrawlSpider in scrapy but I want to understand what is the order in which the urls are crawled - FIFO / LIFO?

I want that the crawler should crawl all the links in the start url page and then move on to other URLs which does not seem to be the order.

How can I do this?

aren't you looking for this: http://stackoverflow.com/questions/6566322/scrapy-crawl-urls-in-order ? — warvariuc, Dec 05 '11 at 13:22

score 9 · Accepted Answer · answered Dec 04 '11 at 22:29

9

http://readthedocs.org/docs/scrapy/en/0.14/faq.html#does-scrapy-crawl-in-breath-first-or-depth-first-order

By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:

 DEPTH_PRIORITY = 1
 SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
 SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

answered Dec 04 '11 at 22:29

Acorn

49,061
27
133
172

Thanks. Where do I set this settings? In the crawler class or in the scrapy.cfg file ? – Siddharth Dec 04 '11 at 22:32
Put them in the `settings.py` file in your project module – Acorn Dec 04 '11 at 22:34
Thanks. It worked. On the same lines, how can I allow Scrapy to crawl links whose URL contains a specified regex and not the others? I would still need to go through all the urls (that is, spider them) but crawl only the ones containing a specific regex in its url? – Siddharth Dec 04 '11 at 23:13
[`crawlspider example`](http://readthedocs.org/docs/scrapy/en/0.14/topics/spiders.html#crawlspider-example), [`link extractor documentation`](http://readthedocs.org/docs/scrapy/en/0.14/topics/link-extractors.html#topics-link-extractors) – Acorn Dec 04 '11 at 23:18
I dont think this will solve the problem. It will follow those links which contains those regex. The problem is such: Suppose page A has links to pages B1, B2, B3. Then B1 has links to pages C1, C2, D1, D2. Similarly B2 has links to C3, C4, D2. I want to extract data from pages which start with C(i.e. C1, C2, C3, C4). Writing regex in rules will not follow B1, B2, B3 and eventually will never reach 'C' pages. – Siddharth Dec 04 '11 at 23:22
You just need to define more than one rule. One that gets followed, and one with a callback for parsing. Look at the example I linked to. – Acorn Dec 04 '11 at 23:44
if above code doesn't work, change squeue to squeues – Adarsh Patel Sep 03 '20 at 20:29

Siddhesh Rumade · Answer 2 · 2020-08-21T17:19:22.320

0

You can add this in your settings.py:

    DEPTH_PRIORITY = 1
    SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
    SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

for reference see official documentaion.

edited Aug 21 '20 at 17:19

answered Jul 29 '20 at 06:26

Siddhesh Rumade

1
2

Order of crawling in Scrapy

2 Answers2

Linked