I'm using scrapy to scrape the adidas site: http://www.adidas.com/us/men-shoes
.
But it always shows error:
User timeout caused connection failure: Getting http://www.adidas.com/us/men-shoes took longer than 180.0 seconds..
It retries for 5 times and then fails completely.
I can access the url on chrome but it's not working on scrapy.
I've tried using custom user agents and emulating header requests but It's still doesn't work.
Below is my code:
import scrapy
class AdidasSpider(scrapy.Spider):
name = "adidas"
def start_requests(self):
urls = ['http://www.adidas.com/us/men-shoes']
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Host": "www.adidas.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
for url in urls:
yield scrapy.Request(url, self.parse, headers=headers)
def parse(self, response):
yield(response.body)
Scrapy log:
{'downloader/exception_count': 1,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
'downloader/request_bytes': 224,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2018, 1, 25, 10, 59, 35, 57000),
'log_count/DEBUG': 2,
'log_count/INFO': 9,
'retry/count': 1,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 1, 25, 10, 58, 39, 550000)}
Update
After looking at the request headers using fiddler and doing some tests I found what was causing the issue. Scrapy is sending a Connection: close
header by default due to which I'm not getting any response from the adidas site.
After testing on fiddler by making the same request but without the Connection: close
header, I got the response correctly. Now the problem is how to remove the Connection: close
header?