0

I was building a web scraper to pull hrefs off of https://www.startengine.com/explore, but I was struggling to get any hrefs. I decided to print the webpage and figured out why.

Here is my code:

import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
import re

URL = "https://www.startengine.com/explore"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")

links = []
print(soup)

This is the output:

<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
</html>

Can someone help me work around the "403 Forbidden"?

  • 1
    Yes, it's probably a bot prevention. You're writing a bot, they don't want you doing this. You should respect that. – Barmar Apr 20 '22 at 21:52
  • https://stackoverflow.com/questions/23073209/403-forbidden-output-while-using-beautifulsoup – Selman Apr 20 '22 at 21:54

1 Answers1

2

You need to inject your user-agent as header as follows:

import pandas as pd
import os
import requests
from bs4 import BeautifulSoup
import re

URL = "https://www.startengine.com/explore"
headers={'User-Agent':'mozilla/5.0'}
page = requests.get(URL,headers=headers)
print(page)
soup = BeautifulSoup(page.text, "html.parser")

links = []
print(soup)
Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32