Scrapy and celery `update_state`

Question

I have the following setup (Docker):

Celery linked to Flask setup which runs the Scrapy spider
Flask setup (obviously)
Flask setup gets request for Scrapy -> fire up worker to do some work

Now I wish to update the original flask setup on the progress of the celery worker. BUT there is no way right now to use celery.update_state() inside of the scraper as it has no access to the original task (though it is being run inside of the celery task).

As an aside: am i missing something about the structure of scrapy? It would seem reasonable that I can assign arguments inside of __init__ to be able to use furtheron, but scrapy uses the method as lambda functions it seems..

To answer some questions:

How are you using celery with scrapy? Scrapy is running inside of a celery task, not run from the command line. I also have never heard of scrapyd, is this a subproject of scrapy? I use a remote worker to fire off scrapy from inside of a celery/flask instance, so it is not the same as the thread being intanced by the original request, they are seperate docker instances.

The task.update_state works great! inside of the celery task, but as soon as we are 'in' the spider, we no longer have access to celery. Any ideas?

From the item_scraped signal issue Task.update_state(taskid,meta={}). You can also run without the taskid if scrapy happens to be running in a Celery task itself (as it defaults to self)

Is this sort of like a static way of accessing the current celery task? As I would love that....

score 3 · Accepted Answer · answered Nov 07 '17 at 03:33

I'm not sure how you are firing your spiders, but i've faced the same issue you describe.

My setup is flask as a rest api, which upon requests fires celery tasks to start spiders. I havent gotten to code it yet, but I'll tell you what i was thinking of doing:

from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy import signals
from .your_celery import app



@app.task(bind=True)
def scrapping(self):

    def my_item_scrapped_handler(item, response, spider):
        meta = {
            # fill your state meta as required based on scrapped item, spider, or response object passed as parameters
        }

        # here self refers to the task, so you can call update_state when using bind
        self.update_state(state='PROGRESS',meta=meta)

    settings = get_project_settings()
    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})

    runner = CrawlerRunner(settings)
    d = runner.crawl(MySpider)
    d.addBoth(lambda _: reactor.stop())

    for crawler in runner.crawlers:
        crawler.signals.connect(my_item_scrapped_handler, signal=signals.item_scraped)


    reactor.run()

I'm sorry for not being able to confirm if it works, but as soon as I get around to testing it I'll report back here! I currently can't dedicate as much time as I would like to to this project :(

Do not hesitate to contact me if you think I can help you any further!

Cheers, Ramiro

Sources:

CrawlerRunner crawlers method: https://doc.scrapy.org/en/latest/topics/api.html#scrapy.crawler.CrawlerRunner.crawlers
Celery tasks docs:
- Bound Tasks: http://docs.celeryproject.org/en/latest/userguide/tasks.html#bound-tasks
- Custom states: http://docs.celeryproject.org/en/latest/userguide/tasks.html#custom-states
Scrapy signals: https://doc.scrapy.org/en/latest/topics/signals.html#signals
Running scrapy as scripts: https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

woah buddy, i did actually solve it. But thanks for your input! In the end i handled it the same way you did, albeit a bit less involved — WiseStrawberry, Dec 19 '17 at 11:16

score 1 · Answer 2 · answered Jun 13 '17 at 16:02

Well need a lot more information to answer this.

How are you using celery with Scrapy? Is scrapy running inside of a celery task? I would strongly suggest running scrapy under it's own server if it makes sense for your project scrapyd.

If not then yes the item_scraped signal would be good but only if you have access to the Celery taskid or the Celery task object itself. http://docs.celeryproject.org/en/latest/reference/celery.app.task.html

From the item_scraped signal issue Task.update_state(taskid,meta={}). You can also run without the taskid if scrapy happens to be running in a Celery task itself (as it defaults to self)

Updated my question, hop i clarified some things. @RabidCicada — WiseStrawberry, Jun 14 '17 at 20:03

Scrapy and celery `update_state`

2 Answers2

Linked