Scrapy doesn't seem to be doing DFO

Question

I have a website for which my crawler needs to follow a sequence. So for example, it needs to go a1, b1, c1 before it starts going a2 etc. each of a, b and c are handled by different parse functions and the corresponding urls are created in a Request object and yielded. The following roughly illustrates the code I'm using:

class aspider(BaseSpider):

    def parse(self,response):
        yield Request(b, callback=self.parse_b, priority=10)

    def parse_b(self,response):
        yield Request(c, callback=self.parse_c, priority=20)

    def parse_c(self,response)
        final_function()

However, I find that the sequence of crawls seem to be a1,a2,a3,b1,b2,b3,c1,c2,c3 which is strange since I thought Scrapy is supposed to guarantee depth first.

The sequence doesn't have to be strict, but the site I'm scraping has a limit in place so Scrapy need to start scraping level c as soon as it can before 5 of level bs get crawled. How can this be achieved?

Same, my attempt at adding priority was to try to see if I could influence the outcome. — Mishari, Mar 05 '12 at 16:20
Srcrapy will not crawl automatically using BaseSpider. Since you are using 'BaseSpider", it is very important to show the code that you yield Request of a1, b1, c1, a2, b2, c2 (some may set by "Start-Urls"...not just 'b ,c". — wuliang, May 08 '12 at 05:09
Might be helpful to you: http://stackoverflow.com/questions/6566322/scrapy-crawl-urls-in-order — warvariuc, May 21 '12 at 07:08

score 9 · Accepted Answer · answered Aug 20 '12 at 21:07

9

Depth first searching is exactly what you are describing:

search as deep into a's as possible before moving to b's

To change Scrapy to do breadth-first searching (a1, b1, c1, a2, etc...), change these settings:

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

*Found in the doc.scrapy.org FAQ

answered Aug 20 '12 at 21:07

Bryan Wolfford

2,102
3
20
26

1

This answer is not correct. In the OP's example, DFS is `(a1, b1, c1, a2, ...)` while BFS is `(a1, a2, a3, b1, b2, b3, ...)`. Scrapy does not use true DFS, see the open [GitHub issue](https://github.com/scrapy/scrapy/issues/1739). – JayStrictor Jan 23 '17 at 22:25

score 1 · Answer 2 · answered Jul 08 '12 at 05:18

I believe that you are noticing the difference between depth-first and breadth-first searching algorithms (see Wikipedia for info on both.)

Scrapy has the ability to change which algorithm is used:

"By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:"

See http://doc.scrapy.org/en/0.14/faq.html for more information.

score 0 · Answer 3 · answered Mar 13 '13 at 09:38

0

Scrapy use DFO by default. The reason of the sequence of crawls is that scrapy crawls pages asynchronously. Even though it use DFO, the sequence seems in unreasonable order because of network delay or something else.

answered Mar 13 '13 at 09:38

JavaNoScript

2,345
21
27

Scrapy doesn't seem to be doing DFO

3 Answers3

Linked