Scrapy crawler throws connection timeout when it runs on AWS

Question

I have a spider that when ran locally yields all results as expected, but when I run it on AWS through Jenkins I seem to not be able to connect.

Things I've tried:
- Removing Connection: close header:
Scrapy error:User timeout caused connection failure
https://techmonger.github.io/65/troubleshoot-scrapy-user-timeout/
- Delaying and limiting concurrent requests through custom_settings.

This is the structure of the code without my attempts to solve the issue:

class MySpider(Spider):
    name = 'spider_name'

    RETRY_ENABLED = False

    custom_settings = {
        'ITEM_PIPELINES': {...}
    }

    def start_requests(self):
        for i in ...:
            url = '...'.format(...)
            yield Request(url)

    def parse(self, response: HtmlResponse):
        data = json.loads(response.content)

        for record in data:
            item = ...

            yield item

Logs from Jenkins' Console Output:

INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2019-08-05 09:10:41 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) DEBUG:scrapy.downloadermiddlewares.retry:Retrying http://.../api/v1/...?&a=1&b=2(3)&c=4(5)> (failed 5 times): User timeout caused connection failure.

*Other spiders work fine under other domains of the same website.

Is it possible that the target web server is ignoring requests from the Jenkins IP address? — Gallaecio, Aug 05 '19 at 12:07
Actually I believe that's the case.. I'll try to use a proxy and update the post — displayname, Aug 05 '19 at 12:32
Yes, that solved it. Weird, the rest of the website doesn't block me. @Gallaecio — displayname, Aug 05 '19 at 14:02

Scrapy crawler throws connection timeout when it runs on AWS

0 Answers0