4

I am using Scrapy spiders inside Celery and I am getting this kind of errors randomly

Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/twisted/python/log.py", line 103, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/lib/python2.7/site-packages/twisted/python/log.py", line 86, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/lib/python2.7/site-packages/twisted/python/context.py", line 122, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib/python2.7/site-packages/twisted/python/context.py", line 85, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/lib/python2.7/site-packages/twisted/internet/posixbase.py", line 602, in _doReadOrWrite
    why = selectable.doWrite()
exceptions.AttributeError: '_SIGCHLDWaker' object has no attribute 'doWrite'

I am using:

celery==3.1.19
Django==1.9.4
Scrapy==1.3.0

This is how I run Scrapy inside Celery:

from billiard import Process
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

class MyCrawlerScript(Process):
    def __init__(self, **kwargs):
        Process.__init__(self)
        settings = get_project_settings('my_scraper')
        self.crawler = CrawlerProcess(settings)
        self.spider_name = kwargs.get('spider_name')
        self.kwargs = kwargs

    def run(self):
        self.crawler.crawl(self.spider_name, qwargs=self.kwargs)
        self.crawler.start()

def my_crawl_manager(**kwargs):
    crawler = MyCrawlerScript(**kwargs)
    crawler.start()
    crawler.join()

Inside a celery task, I am calling:

my_crawl_manager(spider_name='my_spider', url='www.google.com/any-url-here')

Please any idea why this is happening?

P.S: I have asked another question Why I am Getting KeyError in Scrapy? I don't know if they are somehow similar

Community
  • 1
  • 1
mou55
  • 660
  • 1
  • 8
  • 13
  • Why indeed! I would love an http://sscce.org/ demonstrating this behavior. – Jean-Paul Calderone Feb 21 '17 at 23:32
  • Did you find a solution ? We have the exact same problem but no solutions so far... – cp2587 Apr 03 '17 at 10:57
  • I have similar issues, I precised my settings [here](https://github.com/scrapy/scrapy/issues/2499). – dtrckd Dec 05 '17 at 18:25
  • Still happen on `Twisted==18.9.0`, `Scrapy==1.5.1` – Ami Hollander Jan 03 '19 at 06:37
  • Python2 most certainly had issues restarting system calls on signals. Python3 is supposed to be a big improvement, however I see in a claim in the github ticket that problem is still present. IMO, this needs to be reproduced with py3.7, latest twisted; then without scrapy (which is incidental). Utimately, twisted + multiprocessing is a bad mix. – Dima Tisnek Jan 09 '19 at 01:31
  • See https://stackoverflow.com/a/54440446/939364 – Gallaecio Jan 31 '19 at 09:56

1 Answers1

0

I had the same issue. I'm working within a complex application, using asyncio, multiprocessing, Twisted and Scrapy all together.

The solution for me was to use asyncioreactor, by installing the alternate reactor before any imports in scrapy:

from twisted.internet import asyncioreactor
asyncioreactor.install()

from scrapy import stuff
# ...
netcoder
  • 66,435
  • 19
  • 125
  • 142
  • `File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/asyncio/selector_events.py", line 267, in _add_reader (handle, None)) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/selectors.py", line 541, in register self._kqueue.control([kev], 0, 0) OSError: [Errno 9] Bad file descriptor` – Ami Hollander Jan 03 '19 at 07:30
  • `future.result() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/_base.py", line 432, in result return self.__get_result() File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise self._exception OSError: [Errno 9] Bad file descriptor` – Ami Hollander Jan 03 '19 at 07:30
  • couldn't paste all stacktrace.. but ill add that i am using both `ProcessPoolExecutor` to run those spiders – Ami Hollander Jan 03 '19 at 07:31