0

so im very new to coding and im just leanring about web scraping.

I am not sure what to do with this, its probably basic stuff. but again not sure what i am doing wrong here probably a very simple solution for some of you. Any help is greatly appriciated

from urllib.request import urlopen as uReq

dcgp_url = "http://news.formulad.com/"

uClient = uReq(dcgp_url)
page_html = uClient.read
uClient.close()

and then it presents me with this error :

C:\Users\mateu\AppData\Local\Programs\Python\Python38-32\python.exe "E:/Discord Bot/Web scraping.py"
Traceback (most recent call last):
  File "E:/Discord Bot/Web scraping.py", line 7, in <module>
    uClient = uReq(dcgp_url)
  File "C:\Users\mateu\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\mateu\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "C:\Users\mateu\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 640, in http_response
    response = self.parent.error(
  File "C:\Users\mateu\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)
  File "C:\Users\mateu\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 502, in _call_chain
    result = func(*args)
  File "C:\Users\mateu\AppData\Local\Programs\Python\Python38-32\lib\urllib\request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Process finished with exit code 1
wovano
  • 4,543
  • 5
  • 22
  • 49
Slappy
  • 11
  • 3
  • 1
    HTTP Error 403 means you are not allowed to access the url you requested. It is probably because you are using a scraper. Setting a user-agent should fix it (please check website's legal doc to verify if this is legal). – narendra-choudhary Apr 16 '20 at 22:59
  • Absolutely no idea why you're getting a 403 with urllib but with the requests module, it seems to work. https://www.w3schools.com/python/module_requests.asp – Sri Apr 16 '20 at 22:59

1 Answers1

1

As explained here and here, the website you want to visit rejects GET requests that do not identify a User-Agent. You can find your User-Agent by searching 'my user agent' in Google.

The following code should work :

from urllib.request import urlopen, Request

dcgp_url = "http://news.formulad.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
req = Request(url=dcgp_url, headers=headers) 
uClient = urlopen(req)
page_html = uClient.read()
uClient.close()
Skryge
  • 81
  • 1
  • 9