I created a spider that scrapes data from Yelp by using Scrapy. All requests go through Crawlera proxy. Spider gets the URL to scrape from, sends a request, and scrapes the data. This worked fine up until the other day, when I started getting 502 None response. The 502 None response appears after execution of this line:
r = self.req_session.get(url, proxies=self.proxies, verify='../secret/crawlera-ca.crt').text
The traceback:
2020-11-04 14:27:55 [urllib3.connectionpool] DEBUG: https://www.yelp.com:443 "GET /biz/a-dog-in-motion-arcadia HTTP/1.1" 502 None
So, it seems that spider cannot reach the URL because the connection is closed.
I have checked 502 meaning in Scrapy and Crawlera documentation, and it refers to connection being refused, closed, domain unavailable and similar things. I have debugged the code related to where the issue is happening, and everything is up to date.
If anyone has ideas or knowledge about this, would love to hear, as I am stuck. What could actually be the issue here?
NOTE: Yelp URLs work normally when I open them in browser.