1

I am trying to make a python3 script that iterates through a list of mods hosted on a shared website and download the latest one. I have gotten stuck on step one, go to the website and get the mod version list. I am trying to use urllib but it is throwing a 403: Forbidden error.

I thought it might be due to this being some sort of anti-scraping rejection from the server and I read that you could get around it via defining the headers. I ran wireshark while using my browser and was able to identify the headers it was sending out:

Host: ocsp.pki.goog\r\n
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0\r\n
Accept: */*\r\n
Accept-Language: en-US,en;q=0.5\r\n
Accept-Encoding: gzip, deflate\r\n
Content-Type: application/ocsp-request\r\n
Content-Length: 83\r\n
Connection: keep-alive\r\n
\r\n

I believe I was able to define the header correctly, but I had to back two entries out as they gave a 400 error:

from urllib.request import Request, urlopen

count = 0
mods = ['mod1', 'mod2', ...] #this has been created to complete the URL and has been tested to work

#iterate through all mods and download latest version
while mods:
    url = 'https://Domain/'+mods[count]
    #change the header to the browser I was using at the time of writing the script
    req = Request(url)
    #req.add_header('Host', 'ocsp.pki.goog\\r\\n') #this reports 400 bad request
    req.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:85.0) Gecko/20100101 Firefox/85.0\\r\\n')
    req.add_header('Accept', '*/*\\r\\n')
    req.add_header('Accept-Language', 'en-US,en;q=0.5\\r\\n')
    req.add_header('Accept-Encoding', 'gzip, deflate\\r\\n')
    req.add_header('Content-Type', 'application/ocsp-request\\r\\n')
    #req.add_header('Content-Length', '83\\r\\n') #this reports 400 bad request
    req.add_header('Connection', 'keep-alive\\r\\n')
    html = urlopen(req).read().decode('utf-8')

This still throws a 403: Forbidden error:

Traceback (most recent call last):
  File "SCRIPT.py", line 19, in <module>
    html = urlopen(req).read().decode('utf-8')
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
    response = self.parent.error(
  File "/usr/lib/python3.8/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

I'm not sure what I'm doing wrong. I assume there is something wrong with the way I've defined my header values, but I am not sure what is wrong with them. Any help would be appreciated.

  • Have you tried the `requests` library or UNIX's `curl`? – astrochun Feb 17 '21 at 02:29
  • Hi, I tried via requests and this also provided a 403 response: 'import requests mod = mods.pop() print(mod) url = file_url+mod+file_filter print(url) req = requests.get(url) print(req)' Output: 'Desired mod Desired URL ' – TLShandshake Feb 17 '21 at 12:39
  • Strangely enough, using `curl` the terminal does work. – TLShandshake Feb 17 '21 at 12:46
  • So what was the full `curl` command that work? – astrochun Feb 17 '21 at 13:31
  • You might try this. It allows you to bootstrap a `curl` command into a `requests` call: https://curl.trillworks.com/ – astrochun Feb 17 '21 at 13:32
  • Ok, so I've done some more diving into this, thank you for your help so far. I didn't actually read what the HTML I was getting from the curl said when I first responded. It seems that even curl is having the same problem: the website is trying to do captcha on my requests no matter what tool I use. I'm confused as to why adjusting the header isn't fixing this since browsing to the website in Firefox does not require captcha. I will dig a bit deeper on resolving captcha (or what the Browser is doing to avoid it). – TLShandshake Feb 18 '21 at 14:51
  • I don't know how but I recall a code that had some API key for captcha if that is really the problem. Good luck! – astrochun Feb 18 '21 at 14:57

0 Answers0