3

I want to know how Scrapy filters those crawled urls? Does it store all urls which are crawled in something like crawled_urls_list, and when it get a new url it looks up the list to check if the url exists ?

Where are the codes of this filtering part of CrawlSpider(/path/to/scrapy/contrib/spiders/crawl.py) ?

Thanks a lot!

JavaNoScript
  • 2,345
  • 21
  • 27

1 Answers1

5

By default scrapy keep a fingerprint of seen requests. This list is kept in memory in a python set and appended a file call requests.seen in the directory defined by the JOBDIR variable. If you restart scrapy the file is reloaded into the python set. The class that control this is in scrapy.dupefilter You can overload this class if you need a different behaviour.

gvtech
  • 375
  • 2
  • 2
  • Thanks a loooooooooot! Your answer has done a big favour for me! What I want to do is to pause the spider when crawling. Here is a link for some other people who also want to pause the spider: [Pause your spider](http://scrapy.readthedocs.org/en/0.16/topics/jobs.html) – JavaNoScript Nov 30 '12 at 09:28
  • Where can I find the file that contains the URLs that have been scraped please? – William Kinaan Jan 27 '14 at 15:52
  • @XuJiawan Where can I find the file that contains the URLs that have been scraped please? – William Kinaan Jan 27 '14 at 15:53
  • @WilliamKinaan If you define a job directory through the `JOBDIR` setting, you can find a file named `requests.seen` under this directory. It's the file contains the URLs that have been scraped. – JavaNoScript Feb 10 '14 at 11:41