4

I am trying to crawl sites in a very basic manner. But Scrapy isn't crawling all the links. I will explain the scenario as follows-

main_page.html -> contains links to a_page.html, b_page.html, c_page.html
a_page.html -> contains links to a1_page.html, a2_page.html
b_page.html -> contains links to b1_page.html, b2_page.html
c_page.html -> contains links to c1_page.html, c2_page.html
a1_page.html -> contains link to b_page.html
a2_page.html -> contains link to c_page.html
b1_page.html -> contains link to a_page.html
b2_page.html -> contains link to c_page.html
c1_page.html -> contains link to a_page.html
c2_page.html -> contains link to main_page.html

I am using the following rule in CrawlSpider -

Rule(SgmlLinkExtractor(allow = ()), callback = 'parse_item', follow = True))

But the crawl results are as follows -

DEBUG: Crawled (200) http://localhost/main_page.html> (referer: None) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a_page.html> (referer: http://localhost/main_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a1_page.html> (referer: http://localhost/a_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b_page.html> (referer: http://localhost/a1_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b1_page.html> (referer: http://localhost/b_page.html) 2011-12-05 09:56:07+0530 [test_spider] INFO: Closing spider (finished)

It is not crawling all the pages.

NB - I have made the crawling in BFO as it was indicated in the Scrapy Doc.

What am I missing?

Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
Siddharth
  • 5,009
  • 11
  • 49
  • 71

3 Answers3

5

Scrapy will by default filter out all duplicate requests.

You can circumvent this by using (example):

yield Request(url="test.com", callback=self.callback, dont_filter = True)

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

Also see the Request object documentation

Sjaak Trekhaak
  • 4,906
  • 30
  • 39
  • Yes I had read that. But as you will see that is not the problem. The problem is it is not crawling all the pages and not the duplication problem. However, thanks for the answer. I was also looking for the duplication issue. – Siddharth Dec 06 '11 at 06:17
  • It's kinda hard understanding the path it went if I don't visualize it with some sort of web :-D - however I still have a feeling it's filtering out a duplicate page on a certain point. I'd still try the dont_filter option, just to be sure... I can't think of any other reason [without seeing the source html / spider] it wouldn't scrape the remaining pages. – Sjaak Trekhaak Dec 06 '11 at 10:50
  • Yes I had tried using the dont_filter option to False but it doesnt work that way as well. – Siddharth Dec 06 '11 at 12:27
  • You should set the dont_filter kwarg to True (instead of False) to workaround default behaviour – Sjaak Trekhaak Dec 06 '11 at 15:01
3

I had a similar problem today, although I was using a custom spider. It turned out that the website was limiting my crawl because my useragent was scrappy-bot

try changing your user agent and try again. Change it to maybe that of a known browser

Another thing you might want to try is adding a delay. Some websites prevent scraping if the time between request is too small. Try adding a DOWNLOAD_DELAY of 2 and see if that helps

More information about DOWNLOAD_DELAY at http://doc.scrapy.org/en/0.14/topics/settings.html

CodeMonkeyB
  • 2,970
  • 4
  • 22
  • 29
  • Thanks. I am trying to add the download delay. But the site which I took as an example is running on localhost and consists of just simple links only. – Siddharth Dec 05 '11 at 05:05
  • No that didnt work. In the stats which scrapy prints, I am getting something like this. 'request_depth_max': 5, – Siddharth Dec 05 '11 at 05:08
  • You might be using the DepthMiddleware http://readthedocs.org/docs/scrapy/en/latest/topics/spider-middleware.html Look at the DepthMiddleware section – CodeMonkeyB Dec 05 '11 at 05:11
  • Actually no, I am not using a DepthMiddleware (which I think is activated using DEPTH_LIMIT) in settings.py of the scrapy project – Siddharth Dec 05 '11 at 05:15
  • Can someone please help me with this. This is driving me crazy. I am using the latest code from Github. – Siddharth Dec 05 '11 at 07:14
0

Maybe a lot of the URLs are duplicates. Scrapy avoid duplicates since it is inefficient. From what I see from your explanation since you use follow URL rule, of course, there is a lot of duplicates.

If you want to be sure and see the proof in the log, add this to your settings.py.

DUPEFILTER_DEBUG = True

And you'll see this kind of lines in the log:

2016-09-20 17:08:47 [scrapy] DEBUG: Filtered duplicate request: http://www.example.org/example.html>

Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108