0

I created a spider that scrapes data from Yelp by using Scrapy. All requests go through Crawlera proxy. Spider gets the URL to scrape from, sends a request, and scrapes the data. This worked fine up until the other day, when I started getting 502 None response. The 502 None response appears after execution of this line:

r = self.req_session.get(url, proxies=self.proxies, verify='../secret/crawlera-ca.crt').text

The traceback:

2020-11-04 14:27:55 [urllib3.connectionpool] DEBUG: https://www.yelp.com:443 "GET /biz/a-dog-in-motion-arcadia HTTP/1.1" 502 None

So, it seems that spider cannot reach the URL because the connection is closed.

I have checked 502 meaning in Scrapy and Crawlera documentation, and it refers to connection being refused, closed, domain unavailable and similar things. I have debugged the code related to where the issue is happening, and everything is up to date.

If anyone has ideas or knowledge about this, would love to hear, as I am stuck. What could actually be the issue here?

NOTE: Yelp URLs work normally when I open them in browser.

fadingbeat
  • 355
  • 3
  • 16

2 Answers2

1

The website sees that you are a "scraper" and not a human user, from the headers of your request.

You should send a different header with the request, so that the scraped website thinks you are browsing with a regular browser.

For more info, refer to the scrapy documentation.

xqzy
  • 46
  • 6
  • Indeed, this was the issue. I was sending out regular browser headers, but for some reason they stopped being enough. Adding these solved the problem: `DEFAULT_REQUEST_HEADERS = {` `"X-Crawlera-Profile": "desktop",` `"X-Crawlera-Cookies": "disable",` `}` – fadingbeat Nov 09 '20 at 09:08
0

Some pages is not available for some countries, for this reason is recommended to use proxies. I tried to enter the url and the connection was successful.

2020-11-05 02:50:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2020-11-05 02:50:40 [scrapy.core.engine] INFO: Spider opened
2020-11-05 02:50:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/biz/a-dog-in-motion-arcadia> (referer: None)```
Justo
  • 121
  • 3