Running Scrapy spiders in a Celery task

Question

I have a Django site where a scrape happens when a user requests it, and my code kicks off a Scrapy spider standalone script in a new process. Naturally, this isn't working with an increase of users.

Something like this:

class StandAloneSpider(Spider):
    #a regular spider

settings.overrides['LOG_ENABLED'] = True
#more settings can be changed...

crawler = CrawlerProcess( settings )
crawler.install()
crawler.configure()

spider = StandAloneSpider()

crawler.crawl( spider )
crawler.start()

I've decided to use Celery and use workers to queue up the crawl requests.

However, I'm running into issues with Tornado reactors not being able to restart. The first and second spider runs successfully, but subsequent spiders will throw the ReactorNotRestartable error.

Anyone can share any tips with running Spiders within the Celery framework?

score 38 · Accepted Answer · edited Mar 06 '14 at 08:23

38

Okay here is how I got Scrapy working with my Django project that uses Celery for queuing up what to crawl. The actual workaround came primarily from joehillen's code located here http://snippets.scrapy.org/snippets/13/

First the tasks.py file

from celery import task

@task()
def crawl_domain(domain_pk):
    from crawl import domain_crawl
    return domain_crawl(domain_pk)

Then the crawl.py file

from multiprocessing import Process
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings
from spider import DomainSpider
from models import Domain

class DomainCrawlerScript():

    def __init__(self):
        self.crawler = CrawlerProcess(settings)
        self.crawler.install()
        self.crawler.configure()

    def _crawl(self, domain_pk):
        domain = Domain.objects.get(
            pk = domain_pk,
        )
        urls = []
        for page in domain.pages.all():
            urls.append(page.url())
        self.crawler.crawl(DomainSpider(urls))
        self.crawler.start()
        self.crawler.stop()

    def crawl(self, domain_pk):
        p = Process(target=self._crawl, args=[domain_pk])
        p.start()
        p.join()

crawler = DomainCrawlerScript()

def domain_crawl(domain_pk):
    crawler.crawl(domain_pk)

The trick here is the "from multiprocessing import Process" this gets around the "ReactorNotRestartable" issue in the Twisted framework. So basically the Celery task calls the "domain_crawl" function which reuses the "DomainCrawlerScript" object over and over to interface with your Scrapy spider. (I am aware that my example is a little redundant but I did do this for a reason in my setup with multiple versions of python [my django webserver is actually using python2.4 and my worker servers use python2.7])

In my example here "DomainSpider" is just a modified Scrapy Spider that takes a list of urls in then sets them as the "start_urls".

Hope this helps!

edited Mar 06 '14 at 08:23

Balthazar Rouberol

6,822
2
35
41

answered Jul 25 '12 at 19:34

byoungb

1,771
20
31

Do you use Postgresql? I'm getting the strangest "InterfaceError: connection already closed" from celery. – stryderjzw Jul 29 '12 at 22:01
1

yeah I am using postgresql, if you want to provide the error I can try and help you figure it out, I did have an error that sounded similar but I do not recall what it was exactly or what I did. (I am currently loading up 11,077,910 items into my queue and I have 5 worker machines pulling from it, so this setup does work) – byoungb Jul 31 '12 at 13:55
2

I found the problem and it's to do with storing results in postgres. I think it's the following ticket: https://github.com/celery/django-celery/issues/121 Workaround I'm using is to set the CELERY_RESULT_BACKEND. have you encountered this? – stryderjzw Aug 02 '12 at 06:33
No I do not have this issue. My current environment is server python 2.6.6 and workers python 2.7, celery 3.0.3, django-celery 3.0.1 – byoungb Aug 02 '12 at 19:56
Assuming the process will stop after it finishes right? No need to stop it manually? Really elegant way, will probably implement this later also. Because I'm running on Heroku, where I can use a task queue. – Sam Stoelinga Oct 04 '12 at 05:01
dude I followed your example and got a bad file descriptor exception from process. any idea where this came from? – goh Apr 02 '13 at 13:32
@goh need more details, my system has been along time since I had to touch that system, it has been running quietly for almost a year now. And I do not remember if I had anything like that before. – byoungb Apr 03 '13 at 19:04
@Byoungb, my worker is running on python2.7, using 0.16.2 scrapy, twisted 12.2.0 in a osx machine. I've tried exactly what you written above, but might the following exeception: – goh Apr 04 '13 at 15:54
[2013-04-04 23:35:38,838: WARNING/PoolWorker-1] Traceback (most recent call last): [2013-04-04 23:35:38,838: WARNING/PoolWorker-1] File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 249, in _bootstrap [2013-04-04 23:35:38,838: WARNING/PoolWorker-1] sys.stdin.close() [2013-04-04 23:35:38,838: WARNING/PoolWorker-1] IOError: [Errno 9] Bad file descriptor – goh Apr 04 '13 at 15:56
Yeah sorry @goh I think you will have to ask someone who knows more about Twisted then I do. I sort of recall a similar error now, but have no idea what I did to fix it. – byoungb Apr 04 '13 at 18:23
no worries, I don't know if this is more of a twisted problem or a multiprocessing problem. – goh Apr 05 '13 at 03:10
Could you pint me how are you accessing info which scrapy print to stdout? And how did you organize logging. Thanks. – Eugene Nagorny Jun 03 '13 at 15:08
@Ideviantik I actually do not use anything that scrapy prints out to stdout, instead I just wrote my own "pipelines" that collect and record crawl data. And for logging I just used "django-sentry" for logging errors which is very handy for distributed nature of a crawler running multiple treads/processes on multiple machines. – byoungb Jun 03 '13 at 17:58
This answer is obsolete since the latest celery build. You will run into another issue where you cant spawn a subprocess within the celery worker-process. – Bj Blazkowicz Mar 05 '14 at 14:00
1

Yeah you probably correct, and the newest version of celery is so awesome, so when I rewrite this project to use the newest version I will let you know how to get around that issue as well – byoungb Mar 05 '14 at 18:07
could you help me here please? http://stackoverflow.com/questions/25353650/scrapy-how-to-import-the-settings-to-override-it – Marco Dinatsoli Aug 17 '14 at 21:17
2

@byoungb I have the same problem. Does this solution still work with scrapy 1.0? The link you posted is broken now. – loremIpsum1771 Jul 20 '15 at 21:12

score 14 · Answer 2 · answered Aug 12 '13 at 23:43

14

I set CELERYD_MAX_TASKS_PER_CHILD to 1 in the settings file and that took care of the issue. The worker daemon starts a new process after each spider run and that takes care of the reactor.

answered Aug 12 '13 at 23:43

Mondongo

328
10
17

1

That's a clever idea, not sure how much overhead there is with celery restarting a task all the time. But it makes sense that this works. So yeah for future users you may want to try this. – byoungb Nov 04 '13 at 20:32
5

The settings has been renamed in recent celery versions. You can also apply it from the command line using this flag: `--max-tasks-per-child 1` – Carlos Peña Dec 29 '16 at 17:59
1

The setting name for current Celery version 4.3.0 is `WORKER_MAX_TASKS_PER_CHILD`. See docs: https://docs.celeryproject.org/en/latest/userguide/configuration.html & https://docs.celeryproject.org/en/latest/userguide/configuration.html#worker-max-tasks-per-child – Caumons Sep 02 '19 at 14:45

Running Scrapy spiders in a Celery task

2 Answers2

Linked