scrapy crawling just 1 level of a web-site

Question

I am using scrapy to crawl all the web pages under a domain.

I have seen this question. But there is no solution. My problem seems to be similar one. My output of crawl command looks like this:

scrapy crawl sjsu2012-02-22 19:41:35-0800 [scrapy] INFO: Scrapy 0.14.1 started (bot: sjsucrawler)
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Enabled item pipelines: 
2012-02-22 19:41:35-0800 [sjsu] INFO: Spider opened
2012-02-22 19:41:35-0800 [sjsu] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-02-22 19:41:35-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-02-22 19:41:35-0800 [sjsu] DEBUG: Crawled (200) <GET http://cs.sjsu.edu/> (referer: None)
2012-02-22 19:41:35-0800 [sjsu] INFO: Closing spider (finished)
2012-02-22 19:41:35-0800 [sjsu] INFO: Dumping spider stats:
    {'downloader/request_bytes': 198,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 11000,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2012, 2, 23, 3, 41, 35, 788155),
     'scheduler/memory_enqueued': 1,
     'start_time': datetime.datetime(2012, 2, 23, 3, 41, 35, 379951)}
2012-02-22 19:41:35-0800 [sjsu] INFO: Spider closed (finished)
2012-02-22 19:41:35-0800 [scrapy] INFO: Dumping global stats:
    {'memusage/max': 29663232, 'memusage/startup': 29663232}

Problem here is the crawl finds links from first page, but does not visit them. Whats the use of such a crawler.

EDIT:

My crawler code is:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class SjsuSpider(BaseSpider):
    name = "sjsu"
    allowed_domains = ["sjsu.edu"]
    start_urls = [
        "http://cs.sjsu.edu/"
    ]

    def parse(self, response):
        filename = "sjsupages"
        open(filename, 'wb').write(response.body)

All of my other settings are default.

Can you show `Spider` code and `Rules`? – reclosedev Feb 23 '12 at 06:16 — reclosedev, Feb 23 '12 at 06:16

score 10 · Accepted Answer · answered Feb 28 '12 at 00:03

10

I think the best way to do this is by using a Crawlspider. So you have to modify your code to this below to be able to find all links from the first page and visit them:

class SjsuSpider(CrawlSpider):

    name = 'sjsu'
    allowed_domains = ['sjsu.edu']
    start_urls = ['http://cs.sjsu.edu/']
    # allow=() is used to match all links
    rules = [Rule(SgmlLinkExtractor(allow=()), callback='parse_item')]

    def parse_item(self, response):
        x = HtmlXPathSelector(response)

        filename = "sjsupages"
        # open a file to append binary data
        open(filename, 'ab').write(response.body)

If you want to crawl all the links in the website (and not only those in the first level), you have to add a rule to follow every link, so you have to change the rules variable to this one:

rules = [
    Rule(SgmlLinkExtractor(allow=()), follow=True),
    Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]

I have changed your 'parse' callback to 'parse_item' because of this:

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

For more information you can see this: http://doc.scrapy.org/en/0.14/topics/spiders.html#crawlspider

answered Feb 28 '12 at 00:03

Thanasis Petsas

4,378
5
31
57

try to change callback='parse_item' to callback=('parse_item'), or change the rules to this: rules = [ Rule(SgmlLinkExtractor(allow=()), follow=True,callback=('parse_item') ] – Thanasis Petsas Feb 28 '12 at 11:25
Moreover it's a good idea to open the file before the declaration of the crawler and just call only the write method in the parse_item. You can add this before the crawler class: myfile = open(filename, 'ab') and inside the parse_item you can use this: myfile.write(response.body). And after the write call you can use the flush function to force your program to flush the data in your file: myfile.flush(). – Thanasis Petsas Feb 28 '12 at 12:34
2

allow=() will not work because the internals of the Rule is done on a matching basis, i.e if you want all urls to be processed, simply put Rule(SgmlLinkExtractor(allow=('.+',)), callback='parse_item'). Don't forget the comma, as it requires a tuple – goh Mar 02 '12 at 10:39
@goh, to avoid missing the subtle comma/tuple, scrapy code examples use lists, e.g. `Rule(SgmlLinkExtractor(allow=['.+']), callback='parse_item')`. Also I can confirm that the second Rule in @ThanasisPetsas's answer does not call the callback, the callback must be set in the Rule that has `follow=True`, as you have done, @goh. But I'm hesitant to edit the erroneous code in the answer, as I'm new to `Scrapy`. – hobs Jan 10 '14 at 19:31

score 2 · Answer 2 · edited May 23 '17 at 11:59

2

if u are using basespider, in the parse method/ callback, you need to extract your desired urls and return Request objects if you intend to visit these urls.

for url in hxs.select('//a/@href').extract():
            yield Request(url, callback=self.parse)

what parse does is to return you the response and you have to tell what you want to do with it. Its stated in the docs.

Or if u wish to use the CrawlSpider, then u simply define rules for your spider instead.

edited May 23 '17 at 11:59

Community

1
1

answered Feb 23 '12 at 07:37

goh

27,631
28
89
151

1

Thanks. I can see that in docs, but the tutorial does not say anything about it, and I assumed that it will crawl all web pages down the tree under allowed domain. I have not tested yet. But I will definitely let you know. Thanks again BTW. – hrishikeshp19 Feb 23 '12 at 08:04

score 0 · Answer 3 · answered Feb 01 '15 at 10:46

Just in case this is useful. When the crawler does not work like in this case, make sure you delete the following code from your spider file. This is because the spider is configured to call this method by default if it is declared in the file.

def parse(self, response):
  pass

scrapy crawling just 1 level of a web-site

3 Answers3