1

I needed to parse a html from a site, if I run on a localhost, the scrape works normally, only in the deploy I got an 403 Forbidden and I already tried the user-agent and referer as follow bellow:

Obs: This site and me is from Brazil and my code is deployed with Heroku.

Code:

header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36",
    "referer": 'https://www.guichevirtual.com.br'
    }
url = 'https://www.guichevirtual.com.br/passagem-de-onibus/campo-grande-ms/sao-paulo-todas-sp'
r = requests.get(url, headers=header)
print(r.text)

output:

<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
</body>
</html>

if there is an english error, sorry. I'm learning.

2 Answers2

0

Here are a few possibilities that may be in play here:

  1. The site needs more headers to give you access to a page. I'd suggest taking a look at the requests headers documentation if you're confused on how to add them, but I see that you already have a User-Agent header configured, so it should be fairly easy for you to add more.

  2. The resource you are attempting to access requires that you are logged in, or otherwise authenticated with the page/API. 403 Forbidden suggests that you do not have the proper rights to the content. You may need to pass a X-Username and X-Password/X-Pin header in your request, or you may simply be unable to scrape the page.

  3. Your useragent is not an accepted useragent for the site/API. You may need to change it to something else. Per this question/answer it's fairly simple to block a specific useragent. Have you looked at this question which was linked in a comment yet? If not, it might help you.

Grace
  • 315
  • 3
  • 10
  • 1. I'll take a look about. 2. I don't think so... 'cause when I run on localhost the useragent is `'User-Agent': 'python-requests/2.25.1'` and works fine. 3. The useragent that i'm using is the same that works in localhost – Mateus Garcia Nascimento May 24 '21 at 02:31
0

It worked fine for me. The only typo in your code is:

r = request.get(url, headers=header)

is missing an "s"

The correct would be:

r = requests.get(url, headers=header)
SudE
  • 56
  • 6
  • If I open on localhost works fine too! But in the deploy with Heroku don't work. Obs: Writing error, in the code is correct. Thanks! – Mateus Garcia Nascimento May 24 '21 at 00:35
  • Heroku is probably hosted outside Brazil. Maybe the site is using a WAF that blocks traffic like that. You may try using a AWS free tier hosted in the US, just to check if it works. Or, a VPN connected to some other country. – SudE May 24 '21 at 01:52