How Scrapy filters the crawled urls?

Question

I want to know how Scrapy filters those crawled urls? Does it store all urls which are crawled in something like crawled_urls_list, and when it get a new url it looks up the list to check if the url exists ?

Where are the codes of this filtering part of CrawlSpider(/path/to/scrapy/contrib/spiders/crawl.py) ?

Thanks a lot!

score 5 · Accepted Answer · answered Nov 29 '12 at 15:50

5

By default scrapy keep a fingerprint of seen requests. This list is kept in memory in a python set and appended a file call requests.seen in the directory defined by the JOBDIR variable. If you restart scrapy the file is reloaded into the python set. The class that control this is in scrapy.dupefilter You can overload this class if you need a different behaviour.

answered Nov 29 '12 at 15:50

gvtech

375
2
2

Thanks a loooooooooot! Your answer has done a big favour for me! What I want to do is to pause the spider when crawling. Here is a link for some other people who also want to pause the spider: [Pause your spider](http://scrapy.readthedocs.org/en/0.16/topics/jobs.html) – JavaNoScript Nov 30 '12 at 09:28
Where can I find the file that contains the URLs that have been scraped please? – William Kinaan Jan 27 '14 at 15:52
@XuJiawan Where can I find the file that contains the URLs that have been scraped please? – William Kinaan Jan 27 '14 at 15:53
@WilliamKinaan If you define a job directory through the `JOBDIR` setting, you can find a file named `requests.seen` under this directory. It's the file contains the URLs that have been scraped. – JavaNoScript Feb 10 '14 at 11:41

How Scrapy filters the crawled urls?

1 Answers1

Linked