Scrapy error:User timeout caused connection failure

Question

I'm using scrapy to scrape the adidas site: http://www.adidas.com/us/men-shoes. But it always shows error:

User timeout caused connection failure: Getting http://www.adidas.com/us/men-shoes took longer than 180.0 seconds..

It retries for 5 times and then fails completely.

I can access the url on chrome but it's not working on scrapy.
I've tried using custom user agents and emulating header requests but It's still doesn't work.

Below is my code:

import scrapy


class AdidasSpider(scrapy.Spider):
    name = "adidas"

    def start_requests(self):

        urls = ['http://www.adidas.com/us/men-shoes']

        headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "en-US,en;q=0.9",
            "Cache-Control": "max-age=0",
            "Connection": "keep-alive",
            "Host": "www.adidas.com",
            "Upgrade-Insecure-Requests": "1",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
        }

        for url in urls:
            yield scrapy.Request(url, self.parse, headers=headers)

    def parse(self, response):
        yield(response.body)

Scrapy log:

{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'downloader/request_bytes': 224,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2018, 1, 25, 10, 59, 35, 57000),
 'log_count/DEBUG': 2,
 'log_count/INFO': 9,
 'retry/count': 1,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2018, 1, 25, 10, 58, 39, 550000)}

Update

After looking at the request headers using fiddler and doing some tests I found what was causing the issue. Scrapy is sending a Connection: close header by default due to which I'm not getting any response from the adidas site.

After testing on fiddler by making the same request but without the Connection: close header, I got the response correctly. Now the problem is how to remove the Connection: close header?

I solved the problem by using scrapy-splash, the `Connection` header can be overidden in splash. Scrapy should add this feature. — Biswajit Chopdar, Feb 01 '18 at 09:30
Maybe... but this looks like a bug in the adidas web server, not in scrapy. — Jean-Paul Calderone, Feb 01 '18 at 12:20
Could be but I found few other peoples too that had the same problem. Did you try it? are you able to change/remove the `Connection: close` header? — Biswajit Chopdar, Feb 01 '18 at 19:18
That seems orthogonal to whether the bug is in the client or the server. Read the HTTP RFCs. Can you find justification for the server randomly dropping the connection without sending a response due to the presence of `Connection: close`? — Jean-Paul Calderone, Feb 01 '18 at 19:31
I did tested on fiddler as you can see in the screenshots. Only by removing the `Connection: close` header, I got the response. — Biswajit Chopdar, Feb 01 '18 at 19:36
Yes, good job. But "I found a server that behaves this way" doesn't equate to "this is the correct behavior". Servers on the internet have been known to have bugs. — Jean-Paul Calderone, Feb 01 '18 at 20:07
Yes, correct. Still having the option to override the connection header will be helpful when dealing with these type of servers. — Biswajit Chopdar, Feb 01 '18 at 20:24

score 5 · Accepted Answer · answered Feb 01 '18 at 09:35

As scrapy doesn't let you to edit the Connection: close header. I used scrapy-splash instead to make the requests using splash.

Now the Connection: close header can be overidden and everythings working now. The downside is that now the web page has to load all the the assets before I get the response from splash, slower but works.

Scrapy should add the option to edit their default Connection: close header. It is hardcoded in the library and cannot be overidden easily.

Below is my working code:

headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Host": "www.adidas.com",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
    }

    def start_requests(self):
        url = "http://www.adidas.com/us/men-shoes?sz=120&start=0"
        yield SplashRequest(url, self.parse, headers=self.headers)

score 2 · Answer 2 · answered Jan 26 '18 at 10:35

well, at least you should use the headers you wrote by adding 'headers=headers' to your scrapy.Request. However, it's still not working even after i tried to yield scrapy.Request(url, self.parse, headers=headers)

So next i changed the User-Agent in the settings.py with the one from your headers, i.e. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" and didn't use the headers you wrote in scrapy.Request, it worked.

Maybe there is something wrong in the headers. But i'm pretty sure it's not about cookies.

notorious.no · Answer 3 · 2018-01-25T18:01:39.573

I tried accessing the the site using curl and the connection hangs.

curl -v -L http://www.adidas.com/us/men-shoes

So I jumped into the browser's debugger and noticed there was a Cookie header in the request. So then I copied the entire value from the header and pasted it into the curl --headers command.

curl -v -L -H 'Cookie:<cookie value here>' http://www.adidas.com/us/men-shoes

Now the HTML content returns. So the site, at some point, sets cookies that are required to access the remainder of the site. Unfortunately, I'm not sure where or how to acquire the cookies programmatically. Let us know if you figure it out. Hope this helps.

Update

Looks like there are ways to use persistent session data (ie. cookies) in Scrapy (I've never had to use it till this point :)). Take a look at this answer and this doc. I thought maybe the site was redirecting requests to set the cookie, but it's not. So it should be relatively simple problem to fix.

score 1 · Answer 4 · answered Jan 28 '18 at 00:50

Using your code, the first connection works just fine for me - it uses the headers you give and gets the correct response. I modified your parse method to follow the product links and print the content of the <title> tags from the received pages, and that worked fine too. Sample log and printout below. I suspect you're being slowed because of excessive requests.

2018-01-27 16:48:23 [scrapy.core.engine] INFO: Spider opened
2018-01-27 16:48:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-27 16:48:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-01-27 16:48:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.adidas.com/us/men-shoes> (referer: None)
2018-01-27 16:48:25 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://www.adidas.com/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2018-01-27 16:48:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.adidas.com/us/alphabounce-beyond-shoes/DB1126.html> from <GET http://www.adidas.com/us/alphabounce-beyond-shoes/DB1126.html>
2018-01-27 16:48:25 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.adidas.com/us/ultraboost-laceless-shoes/BB6137.html> from <GET http://www.adidas.com/us/ultraboost-laceless-shoes/BB6137.html>

<snipped a bunch>

2018-01-27 16:48:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.adidas.com/us/> (referer: http://www.adidas.com/us/men-shoes)
2018-01-27 16:48:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.adidas.com/us/nmd_cs2-primeknit-shoes/BY3012.html> (referer: http://www.adidas.com/us/men-shoes)
adidas Alphabounce Beyond Shoes - White | adidas US
adidas UA&SONS NMD R2 Shoes - Grey | adidas US
adidas NMD_C2 Shoes - Brown | adidas US
adidas NMD_CS2 Primeknit Shoes - Grey | adidas US
adidas NMD_Racer Primeknit Shoes - Black | adidas US
adidas Official Website | adidas US
adidas NMD_CS2 Primeknit Shoes - Black | adidas US
2018-01-27 16:48:26 [scrapy.core.engine] INFO: Closing spider (finished)
2018-01-27 16:48:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

Sometimes it works out of the blue. but If the `Connection: close` header is not present, it works everytime. Try running the spider multiple times to see if consecutive runs work. — Biswajit Chopdar, Jan 28 '18 at 08:22
I did, at least eight or ten times in the course of testing. Always worked. :shrug: Glad you found a solution, though! — Nathan Vērzemnieks, Feb 02 '18 at 22:18

score 1 · Answer 5 · answered Feb 01 '18 at 04:34

1

You could use this tool https://curl.trillworks.com/ to

Get a curl command from Chrome
Run the converted python code (I got response 200 from your URL via Requests)
Copy the headers and cookies for your scrapy.Request

answered Feb 01 '18 at 04:34

northtree

8,569
11
61
80

Can you please show exactly how to execute each of the steps? At least the 1st one – Ostap Didenko Sep 12 '19 at 13:11
@OstapDidenko https://www.alexkras.com/copy-any-api-call-as-curl-request-with-chrome-developer-tools/ – northtree Sep 12 '19 at 13:13

Scrapy error:User timeout caused connection failure

Update

5 Answers5

Update

Linked