Beautiful soup returning empty in PythonAnywhere

Question

I have a bs4 app that would in this context prints the most recent post on igg-games.com
Code:

from bs4 import BeautifulSoup
import requests

def get_new():
    new = {}
    for i in BeautifulSoup(requests.get('https://igg-games.com/').text, features="html.parser").find_all('article'):
        elem = i.find('a', class_='uk-link-reset')
        new[elem.get_text()] = (elem.get('href'), ", ".join([x.get_text() for x in i.find_all('a', rel = 'category tag')]), i.find('time').get_text())
    return new
current = get_new()
new_item = list(current.items())[0]
print(f"Title: {new_item[0]}\nLink: {new_item[1][0]}\nCatagories: {new_item[1][1]}\nAdded: {new_item[1][2]}")

Output on my machine:

Title: Beholder�s Lair Free Download
Link: https://igg-games.com/beholders-lair-free-download.html
Catagories: Action, Adventure
Added: January 7, 2021

I know it works. However, my end goal is to turn this into rss feed entries. So I plugged it all into a premium PythonAnywhere container. However, my function get_new() returns {}. Is there something I need to do that I'm missing?

Possibly response code from `requests.get()` is not 200, so the website bans requests from specific ip addresses (in that case PythonAnywhere). You possibly need to use some kind of (rotating) proxy for that and try to specify some *user agent* in headers of request. — Dmytro O, Jan 07 '21 at 17:29
That's a good point, thank you. Any suggestions as to what I should use? I've never had to specify a proxy in this context or a user agent. — Isaiah, Jan 07 '21 at 17:32
For user agents, take a look here https://stackoverflow.com/questions/27652543/how-to-use-python-requests-to-fake-a-browser-visit-a-k-a-and-generate-user-agent This tutorial on rotating proxies can also be helpful: https://codelike.pro/create-a-crawler-with-rotating-ip-proxy-in-python/ — Dmytro O, Jan 07 '21 at 18:07
That solved it! Thank you very much! I'll add an answer to this question now. — Isaiah, Jan 07 '21 at 18:35

score 2 · Answer 1 · answered Jan 07 '21 at 18:42

Solved thanks to the help of Dmytro O.

Since it was likely that PythonAnywhere was blocked as a client, setting the user agent allowed me to receive a response from my intended site.

#the fix
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)

when placed in my code

def get_new():
    new = {}
    for i in BeautifulSoup(requests.get('https://igg-games.com/', headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}).text, features="html.parser").find_all('article'):
        elem = i.find('a', class_='uk-link-reset')
        new[elem.get_text()] = (elem.get('href'), ", ".join([x.get_text() for x in i.find_all('a', rel = 'category tag')]), i.find('time').get_text())
    return new

This method was provided to me through this stack overflow post: How to use Python requests to fake a browser visit a.k.a and generate User Agent?

Beautiful soup returning empty in PythonAnywhere

1 Answers1