How can I scrape all URL links to games on the Cyberix3D website (http://www.3dgamemaker.com) using Python 3.7.4 and beautifulsoup4?

Question

I'm trying to adapt the code in https://stackoverflow.com/a/46135607/9637147 to scrape all URL links for games on the Cyberix3D website. But it fails to do so when I run my code, giving me a 403 Forbidden error. How do I fix my code?

This is so I can archive all of the games on the Cyberix3D website onto the Wayback Machine (http://web.archive.org/) quicker. I've tried adding the line useragent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) Gecko/20170101 Firefox/67.0".encode("utf-8") before the first line of the for loop, then replacing html=urlopen(url) with html=urlopen(url,useragent) to allow the code to use that user agent, but even then, I still get a 403 Forbidden error.

from urllib.request import urlopen
from bs4 import BeautifulSoup
file="Cyberix3D games.csv"
f=open(file,"w")
Headers="Link\n"
f.write(Headers)
for page in range(1,410):
    url="http://www.gamemaker3d.com/games#page={}&orderBy=Recent".format(page)
    html=urlopen(url)
    soup=BeautifulSoup(html,"html.parser")
    Title=soup.find_all("a",{"href":"views-field-nothing"})
    for i in Title:
        try:
            link=i.find("a",{"href":"/player?pid="}).get_text()
            print(link)
            f.write("{}".format(link))
        except:AttributeError
f.close()

I expect the aforementioned links to be printed in the Python 3.7.4 Shell and also be added to a CSV file called Cyberix3D games.csv, but I get urllib.error.HTTPError: HTTP Error 403: Forbidden, following a bunch of File "C:\Users\Niall Ward\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line x, in ys, in the Python 3.7.4 Shell, as well as an empty CSV file called Cyberix3D games.csv, instead.

Hey Niall. 403 means that there are some authorization issues with the URL. Is the URL behind a login page? If so, your Python code might not be able to access it directly. — Yuvraj Jaiswal, Sep 27 '19 at 05:28

score 1 · Answer 1 · answered Sep 27 '19 at 05:59

Some websites block connections that don't come from browsers - anti bots, spam, etc. There are many different solutions that could work: emulating a browser to get, let's say, proxy a legit responses; you could add a header to your request; etc.

After running your code I tried a simpler solution than those I mentioned above: Instead of using from urllib.request import urlopen I used import requests and to do so I had to change

# Start by importing requests
import requests
from bs4 import BeautifulSoup
file="Cyberix3D games.csv"
f=open(file,"w")
Headers="Link\n"
f.write(Headers)
for page in range(1,410):
    url="http://www.gamemaker3d.com/games#page={}&orderBy=Recent".format(page)
    print(url)
    # Here we use requests to get the page and its content. 
    # Note that variables names don't really matter here.
    gamemaker_link=requests.get(url)
    # Used gamemnaker_link.contetnt and lxml as my parser.
    gamemaker_content=BeautifulSoup(gamemaker_link.content, "lxml")

    # etc etc etc

Requirements

If you haven't, you will need to install (I used pip):

requests
lxml

Note

I am not sure if anything changes with handling the page elements, but this should at leats help with accessing the page.

Hope it helps.

Happy coding!

That solution helps with accessing the page, but only prints the links to the pages that contain the links to games on the Cyberix3D website, and creates a blank file named Cyberix3D games.csv. I'm wishing for my program to add all links to games on the Cyberix3D website to a .csv file and/or any other formats, such as Python Shell or .txt. Here's two example links (Note: The users are from the Cyberix3D website, not StackOverflow): http://gamemaker3d.com/player?pid=055599149072 (Downloae by Mobile-and-Computer-tips) and http://gamemaker3d.com/player?pid=055599049069 (Gun Man by taseenhaseen). — Niall Ward, Sep 29 '19 at 06:31
Hello again, @NiallWard! The solution I posted tries to attend to your question "[...] But it fails to do so when I run my code, giving me a 403 Forbidden error. How do I fix my code?". As I mentioned, I am not sure if you have to change anything later on your code in order to be able to extract the information you are looking for. I would assume that it has to be something like what you already had: Title=soup.find_all("a",{"href":"views-field-nothing"}). Anyhow, since this question refers to the connection error, I'd suggest you ask another question and direct it to your new issue. — Luís Flávio, Sep 30 '19 at 17:39

How can I scrape all URL links to games on the Cyberix3D website (http://www.3dgamemaker.com) using Python 3.7.4 and beautifulsoup4?

1 Answers1

Requirements

Note