2

I'm trying to adapt the code in https://stackoverflow.com/a/46135607/9637147 to scrape all URL links for games on the Cyberix3D website. But it fails to do so when I run my code, giving me a 403 Forbidden error. How do I fix my code?

This is so I can archive all of the games on the Cyberix3D website onto the Wayback Machine (http://web.archive.org/) quicker. I've tried adding the line useragent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Gecko/20170101 Firefox/67.0".encode("utf-8") before the first line of the for loop, then replacing html=urlopen(url) with html=urlopen(url,useragent) to allow the code to use that user agent, but even then, I still get a 403 Forbidden error.

from urllib.request import urlopen
from bs4 import BeautifulSoup
file="Cyberix3D games.csv"
f=open(file,"w")
Headers="Link\n"
f.write(Headers)
for page in range(1,410):
    url="http://www.gamemaker3d.com/games#page={}&orderBy=Recent".format(page)
    html=urlopen(url)
    soup=BeautifulSoup(html,"html.parser")
    Title=soup.find_all("a",{"href":"views-field-nothing"})
    for i in Title:
        try:
            link=i.find("a",{"href":"/player?pid="}).get_text()
            print(link)
            f.write("{}".format(link))
        except:AttributeError
f.close()

I expect the aforementioned links to be printed in the Python 3.7.4 Shell and also be added to a CSV file called Cyberix3D games.csv, but I get urllib.error.HTTPError: HTTP Error 403: Forbidden, following a bunch of File "C:\Users\Niall Ward\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line x, in ys, in the Python 3.7.4 Shell, as well as an empty CSV file called Cyberix3D games.csv, instead.

Tej
  • 86
  • 5
Niall Ward
  • 51
  • 3
  • Hey Niall. 403 means that there are some authorization issues with the URL. Is the URL behind a login page? If so, your Python code might not be able to access it directly. – Yuvraj Jaiswal Sep 27 '19 at 05:28

1 Answers1

1

Some websites block connections that don't come from browsers - anti bots, spam, etc. There are many different solutions that could work: emulating a browser to get, let's say, proxy a legit responses; you could add a header to your request; etc.

After running your code I tried a simpler solution than those I mentioned above: Instead of using from urllib.request import urlopen I used import requests and to do so I had to change

# Start by importing requests
import requests
from bs4 import BeautifulSoup
file="Cyberix3D games.csv"
f=open(file,"w")
Headers="Link\n"
f.write(Headers)
for page in range(1,410):
    url="http://www.gamemaker3d.com/games#page={}&orderBy=Recent".format(page)
    print(url)
    # Here we use requests to get the page and its content. 
    # Note that variables names don't really matter here.
    gamemaker_link=requests.get(url)
    # Used gamemnaker_link.contetnt and lxml as my parser.
    gamemaker_content=BeautifulSoup(gamemaker_link.content, "lxml")

    # etc etc etc

Requirements

If you haven't, you will need to install (I used pip):

  1. requests
  2. lxml

Note

I am not sure if anything changes with handling the page elements, but this should at leats help with accessing the page.

Hope it helps.

Happy coding!

Luís Flávio
  • 171
  • 1
  • 12
  • That solution helps with accessing the page, but only prints the links to the pages that contain the links to games on the Cyberix3D website, and creates a blank file named Cyberix3D games.csv. I'm wishing for my program to add all links to games on the Cyberix3D website to a .csv file and/or any other formats, such as Python Shell or .txt. Here's two example links (Note: The users are from the Cyberix3D website, not StackOverflow): http://gamemaker3d.com/player?pid=055599149072 (Downloae by Mobile-and-Computer-tips) and http://gamemaker3d.com/player?pid=055599049069 (Gun Man by taseenhaseen). – Niall Ward Sep 29 '19 at 06:31
  • Hello again, @NiallWard! The solution I posted tries to attend to your question "[...] But it fails to do so when I run my code, giving me a 403 Forbidden error. How do I fix my code?". As I mentioned, I am not sure if you have to change anything later on your code in order to be able to extract the information you are looking for. I would assume that it has to be something like what you already had: Title=soup.find_all("a",{"href":"views-field-nothing"}). Anyhow, since this question refers to the connection error, I'd suggest you ask another question and direct it to your new issue. – Luís Flávio Sep 30 '19 at 17:39