I have a spider that when ran locally yields all results as expected, but when I run it on AWS through Jenkins I seem to not be able to connect.
Things I've tried:
- Removing Connection: close
header:
Scrapy error:User timeout caused connection failure
https://techmonger.github.io/65/troubleshoot-scrapy-user-timeout/
- Delaying and limiting concurrent requests through custom_settings
.
This is the structure of the code without my attempts to solve the issue:
class MySpider(Spider):
name = 'spider_name'
RETRY_ENABLED = False
custom_settings = {
'ITEM_PIPELINES': {...}
}
def start_requests(self):
for i in ...:
url = '...'.format(...)
yield Request(url)
def parse(self, response: HtmlResponse):
data = json.loads(response.content)
for record in data:
item = ...
yield item
Logs from Jenkins' Console Output
:
INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-08-05 09:10:41 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) DEBUG:scrapy.downloadermiddlewares.retry:Retrying http://.../api/v1/...?&a=1&b=2(3)&c=4(5)> (failed 5 times): User timeout caused connection failure.
*Other spiders work fine under other domains of the same website.