How to resolve 502 response code in Scrapy request?

Question

I created a spider that scrapes data from Yelp by using Scrapy. All requests go through Crawlera proxy. Spider gets the URL to scrape from, sends a request, and scrapes the data. This worked fine up until the other day, when I started getting 502 None response. The 502 None response appears after execution of this line:

r = self.req_session.get(url, proxies=self.proxies, verify='../secret/crawlera-ca.crt').text

The traceback:

2020-11-04 14:27:55 [urllib3.connectionpool] DEBUG: https://www.yelp.com:443 "GET /biz/a-dog-in-motion-arcadia HTTP/1.1" 502 None

So, it seems that spider cannot reach the URL because the connection is closed.

I have checked 502 meaning in Scrapy and Crawlera documentation, and it refers to connection being refused, closed, domain unavailable and similar things. I have debugged the code related to where the issue is happening, and everything is up to date.

If anyone has ideas or knowledge about this, would love to hear, as I am stuck. What could actually be the issue here?

NOTE: Yelp URLs work normally when I open them in browser.

Have you considered disabling cookies in both Scrapy and Crawlera? — Gallaecio, Nov 06 '20 at 17:53
That worked. I added some additional headers for Crawlera that I did not have before. Thank you. — fadingbeat, Nov 09 '20 at 09:06

xqzy · Answer 1 · 2020-11-08T01:25:51.043

1

The website sees that you are a "scraper" and not a human user, from the headers of your request.

You should send a different header with the request, so that the scraped website thinks you are browsing with a regular browser.

For more info, refer to the scrapy documentation.

edited Nov 08 '20 at 01:25

answered Nov 07 '20 at 20:33

xqzy

46
6

Indeed, this was the issue. I was sending out regular browser headers, but for some reason they stopped being enough. Adding these solved the problem: `DEFAULT_REQUEST_HEADERS = {` `"X-Crawlera-Profile": "desktop",` `"X-Crawlera-Cookies": "disable",` `}` – fadingbeat Nov 09 '20 at 09:08

score 0 · Answer 2 · answered Nov 05 '20 at 02:55

Some pages is not available for some countries, for this reason is recommended to use proxies. I tried to enter the url and the connection was successful.

2020-11-05 02:50:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2020-11-05 02:50:40 [scrapy.core.engine] INFO: Spider opened
2020-11-05 02:50:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/biz/a-dog-in-motion-arcadia> (referer: None)```

That's great, though I am using proxy and still getting 502. — fadingbeat, Nov 05 '20 at 10:57

How to resolve 502 response code in Scrapy request?

2 Answers2