4

I'm a writing a crawler in Python that crawls all pages in a given domain, as part of a domain-specific search engine . I'am using Django, Scrapy, and Celery for achieving this. The scenario is as follows:

I receive a domain name from the user and call the crawl task inside the view, passing the domain as an argument:

crawl.delay(domain)

The task itself just calls a function that starts the crawling process:

from .crawler.crawl import run_spider
from celery import shared_task

@shared_task
def crawl(domain):
    return run_spider(domain) 

run_spider starts the crawling process, as in this SO answer, replacing MySpider with WebSpider.

WebSpider inherits from CrawlSpider and I'm using it now just to test functionality. The only rule defined takes an SgmlLinkExtractor instance and a callback function parse_page which simply extracts the response url and the page title, populates a new DjangoItem (HTMLPageItem) with them and saves it into the database (not so efficient, I know).

from urlparse import urlparse
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import HTMLPageItem
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule, CrawlSpider

class WebSpider(CrawlSpider):
    name = "web"

    def __init__(self, **kw):
        super(WebSpider, self).__init__(**kw)
        url = kw.get('domain') or kw.get('url')
        if not (url.startswith('http://') or url.startswith('https://')):
            url = "http://%s/" % url
        self.url = url
        self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
        self.start_urls = [url]
        self.rules = [
            Rule(SgmlLinkExtractor(
                allow_domains=self.allowed_domains,
                unique=True), callback='parse_page', follow=True)
        ]

    def parse_start_url(self, response):
        return self.parse_page(response)

    def parse_page(self, response):
        sel = Selector(response)
        item = HTMLPageItem()
        item['url'] = response.request.url
        item['title'] = sel.xpath('//title/text()').extract()[0]
        item.save()
        return item

The problem is the crawler only crawls the start_urls and does not follow links (or call the callback function) when following this scenario and using Celery. However calling run_spider through python manage.py shell works just fine!

Another problem is that Item Pipelines and logging are not working with Celery. This is making debugging much harder. I think these problems might be related.

Community
  • 1
  • 1
Karim Sonbol
  • 2,581
  • 2
  • 18
  • 16
  • 1
    Consider writing a "hello world" program with Celery, and get logging to work. Not seeing what's going on makes things... difficult :) – johntellsall Jun 15 '14 at 21:41
  • Thanks! I got Celery logging working in no time. I was trying to log using Scrapy, and that wasn't working. And logging did help, I fixed the problem :), will post the answer now. – Karim Sonbol Jun 16 '14 at 11:00
  • Pipelines are still not working though – Karim Sonbol Jun 20 '14 at 07:42

1 Answers1

2

So after inspecting Scrapy's code and enabling Celery logging, by inserting these two lines in web_spider.py:

from celery.utils.log import get_task_logger

logger = get_task_logger(__name__)

I was able to locate the problem: In the initialization function of WebSpider:

super(WebSpider, self).__init__(**kw)

The __init__ function of the parent CrawlSpider calls the _compile_rules function which in short copies the rules from self.rules to self._rules while making some changes. self._rules is what the spider uses when it checks for rules . Calling the initialization function of CrawlSpider before defining the rules led to an empty self._rules, hence no links were followed.

Moving the super(WebSpider, self).__init__(**kw) line to the last line of WebSpider's __init__ fixed the problem.

Update: There is a little mistake in code from the previously mentioned SO answer. It causes the reactor to hang after second call. The fix is simple, in WebCrawlerScript's __init__ method, simply move this line:

self.crawler.signals.connect(reactor.stop, signal=signals.spider_closed)

out of the if statement, as suggested in the comments there.

Update 2: I finally got pipelines to work! It was not a Celery problem. I realized that the settings module wasn't being read. It was simply an import problem. To fix it:

Set the environment variable SCRAPY_SETTINGS_MODULE in your django project's settings module myproject/settings.py:

import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myapp.crawler.crawler.settings'

In your Scrapy settings module crawler/settings.py, add your Scrapy project path to sys.path so that relative imports in the settings file would work:

import sys
sys.path.append('/absolute/path/to/scrapy/project')

Change the paths to suit your case.

Community
  • 1
  • 1
Karim Sonbol
  • 2,581
  • 2
  • 18
  • 16
  • Were you using Django Dynamic Scraper or just the normal Scrapy? I'm essentially doing something similar to what you did but I need to use all of the files from my scrapy project within the django project that I've made. – loremIpsum1771 Jul 28 '15 at 22:01