0

I'm deploying my Scrapy spider via my local machine to Zyte Cloud (former ScrapingHub). This is successful. When I run the spider I get the output below.

I already checked here. The Zyte team is not very responsive on their own site it seems, but I've found developers to be more active here in general :)

My scrapinghub.yml looks like this:

projects:    
  default: <myid>    
requirements:    
  file: requirements.txt

I tried adding these lines to requirements.txt, however, no matter which line I use, the same error with the same output is generated.

  • git+git://github.com/scrapedia/scrapy-useragents
  • git+git://github.com/scrapedia/scrapy-useragents.git
  • git+https://github.com/scrapedia/scrapy-useragents.git

What am I doing wrong? btw: this spider works when I run it on my local machine.

 File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 177, in crawl

   return self._crawl(crawler, *args, **kwargs)

    File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 181, in _crawl

   d = crawler.crawl(*args, **kwargs)

    File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator

   return _cancellableInlineCallbacks(gen)

    File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks

   _inlineCallbacks(None, g, status)

  --- <exception caught here> ---

    File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks

   result = g.send(result)

    File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 89, in crawl

   self.engine = self._create_engine()

    File "/usr/local/lib/python3.8/site-packages/scrapy/crawler.py", line 103, in _create_engine

   return ExecutionEngine(self, lambda _: self.stop())

    File "/usr/local/lib/python3.8/site-packages/scrapy/core/engine.py", line 69, in __init__

   self.downloader = downloader_cls(crawler)

    File "/usr/local/lib/python3.8/site-packages/scrapy/core/downloader/__init__.py", line 83, in __init__

   self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)

    File "/usr/local/lib/python3.8/site-packages/scrapy/middleware.py", line 53, in from_crawler

   return cls.from_settings(crawler.settings, crawler)

    File "/usr/local/lib/python3.8/site-packages/scrapy/middleware.py", line 34, in from_settings

   mwcls = load_object(clspath)

    File "/usr/local/lib/python3.8/site-packages/scrapy/utils/misc.py", line 50, in load_object

   mod = import_module(module)

    File "/usr/local/lib/python3.8/importlib/__init__.py", line 127, in import_module

   return _bootstrap._gcd_import(name[level:], package, level)

    File "<frozen importlib._bootstrap>", line 1014, in _gcd_import

    File "<frozen importlib._bootstrap>", line 991, in _find_and_load

    File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked

    File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed

    File "<frozen importlib._bootstrap>", line 1014, in _gcd_import

    File "<frozen importlib._bootstrap>", line 991, in _find_and_load

    File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked

  builtins.ModuleNotFoundError: No module named 'scrapy_user_agents'

UPDATE 1

Using @Thiago Curvelo's suggestion.

Ok, something weird is happening.

This code worked for me when running the spider locally:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

I then changed it to scrapy_useragents as per your suggestion:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
}

Here I get error when running it locally:

ModuleNotFoundError: No module named 'scrapy_useragents'

However, I also deployed to Zyte shub deploy <myid>

And when running on Zyte I now get different errors, specifically:

Connection was refused by other side: 111: Connection refused.

I'm confused as to what is happening here?

My log (CSV download):

time,level,message
01-10-2021 08:57,INFO,Log opened.
01-10-2021 08:57,INFO,[scrapy.utils.log] Scrapy 2.0.0 started (bot: foobar)
01-10-2021 08:57,INFO,"[scrapy.utils.log] Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.2 (default, Feb 26 2020, 15:09:34) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Linux-4.15.0-72-generic-x86_64-with-glibc2.2.5"
01-10-2021 08:57,INFO,"[scrapy.crawler] Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'BOT_NAME': 'foobar',
 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
 'LOG_ENABLED': False,
 'LOG_LEVEL': 'INFO',
 'MEMUSAGE_LIMIT_MB': 950,
 'NEWSPIDER_MODULE': 'foobar.spiders',
 'SPIDER_MODULES': ['foobar.spiders'],
 'STATS_CLASS': 'sh_scrapy.stats.HubStorageStatsCollector',
 'TELNETCONSOLE_HOST': '0.0.0.0'}"
01-10-2021 08:57,INFO,[scrapy.extensions.telnet] Telnet Password: <password>
01-10-2021 08:57,INFO,"[scrapy.middleware] Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.spiderstate.SpiderState',
 'scrapy.extensions.throttle.AutoThrottle',
 'scrapy.extensions.debug.StackTraceDump',
 'sh_scrapy.extension.HubstorageExtension']"
01-10-2021 08:57,INFO,"[scrapy.middleware] Enabled downloader middlewares:
['sh_scrapy.diskquota.DiskQuotaDownloaderMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'sh_scrapy.middlewares.HubstorageDownloaderMiddleware']"
01-10-2021 08:57,INFO,"[scrapy.middleware] Enabled spider middlewares:
['sh_scrapy.diskquota.DiskQuotaSpiderMiddleware',
 'sh_scrapy.middlewares.HubstorageSpiderMiddleware',
 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']"
01-10-2021 08:57,INFO,"[scrapy.middleware] Enabled item pipelines:
[]"
01-10-2021 08:57,INFO,[scrapy.core.engine] Spider opened
01-10-2021 08:57,INFO,"[scrapy.extensions.logstats] Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)"
01-10-2021 08:57,INFO,[scrapy_useragents.downloadermiddlewares.useragents] Load 0 user_agents from settings.
01-10-2021 08:57,INFO,TelnetConsole starting on 6023
01-10-2021 08:57,INFO,[scrapy.extensions.telnet] Telnet console listening on 0.0.0.0:6023
01-10-2021 08:57,WARNING,"[py.warnings] /usr/local/lib/python3.8/site-packages/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
  url = to_native_str(url)
"
01-10-2021 08:57,ERROR,[scrapy.downloadermiddlewares.retry] Gave up retrying <GET https://www.example.com/allobjects via http://localhost:8050/execute> (failed 3 times): Connection was refused by other side: 111: Connection refused.
01-10-2021 08:57,ERROR,"[scrapy.core.scraper] Error downloading <GET https://www.example.com/allobjects via http://localhost:8050/execute>
Traceback (most recent call last):
  File ""/usr/local/lib/python3.8/site-packages/scrapy/core/downloader/middleware.py"", line 42, in process_request
    defer.returnValue((yield download_func(request=request, spider=spider)))
twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 111: Connection refused."
01-10-2021 08:57,INFO,[scrapy.core.engine] Closing spider (finished)
01-10-2021 08:57,INFO,"[scrapy.statscollectors] Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 3,
 'downloader/request_bytes': 3813,
 'downloader/request_count': 3,
 'downloader/request_method_count/POST': 3,
 'elapsed_time_seconds': 12.989914,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 10, 1, 8, 57, 26, 273397),
 'log_count/ERROR': 2,
 'log_count/INFO': 11,
 'log_count/WARNING': 1,
 'memusage/max': 62865408,
 'memusage/startup': 62865408,
 'retry/count': 2,
 'retry/max_reached': 1,
 'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 2,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/disk': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/disk': 4,
 'splash/execute/request_count': 1,
 'start_time': datetime.datetime(2021, 10, 1, 8, 57, 13, 283483)}"
01-10-2021 08:57,INFO,[scrapy.core.engine] Spider closed (finished)
01-10-2021 08:57,INFO,Main loop terminated.
Adam
  • 6,041
  • 36
  • 120
  • 208
  • Can't you install it via `pip` as `pip install scrapy-useragents`? – gutsytechster Sep 30 '21 at 08:24
  • Thanks, but I don't see any documentation on how to do that on Zyte Cloud, do you? All I see is the approach I describe via requirements.txt – Adam Sep 30 '21 at 08:35
  • Did you not deploy the project via `shub deploy`? This command picks the configuration from the `scrapinghub.yml` file. – gutsytechster Sep 30 '21 at 10:11
  • That's exactly what I did, the deploy is successful via `shub deploy`, but when I run the spider it fails with the above errors – Adam Sep 30 '21 at 10:31
  • Can you try mentioning the requirement as a package, and not a git url in the requirements file, and then deploy it again to see if it works? You can just mention `scrapy-useragents` in the file, and the dependencies will be installed again. – gutsytechster Sep 30 '21 at 11:46
  • Sorry am new to this, so "mentioning the requirement as a package, and not a git url in the requirements file" what exactly do you mean and how do I do that? – Adam Sep 30 '21 at 12:34
  • 1
    Regarding the update, It is looking for a Splash instance which isn't running, [just like your other question](https://stackoverflow.com/questions/69198205/connection-was-refused-by-other-side-10061-no-connection-could-be-made-because?noredirect=1#comment122568308_69198205) – Thiago Curvelo Oct 01 '21 at 15:47
  • Thanks so much again Thiago, you've been truly helpful :). It's unfortunate I have to rely on you just because Zyte support is non-responsive. When checking on Zyte it seems Splash is a separate paid add-on without a trial option, so I'll have to incur costs to even test it. Did you happen to see comment on my other question? Might that work? https://stackoverflow.com/questions/69198205/connection-was-refused-by-other-side-10061-no-connection-could-be-made-because#comment122675679_69198205 – Adam Oct 02 '21 at 11:43

1 Answers1

1

It seems you have a typo in your middlewares settings. Scrapy is looking for a module called scrapy_user_agents, but the correct name is scrapy_useragents.

Double-check the content of DOWNLOADER_MIDDLEWARES, in settings.py. It should look like this:

DOWNLOADER_MIDDLEWARES = {
    # ...
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,
}
Thiago Curvelo
  • 3,711
  • 1
  • 22
  • 38
  • Thank you. Added update 1 as per your suggestion, could you have another look please? – Adam Oct 01 '21 at 09:08