-2

I'm trying to make a web scraper with beautiful soup that will print out the most popular post on reddit and I keep getting an error. Plz explain in simple words if possible. here's the code:

import requests
from bs4 import BeautifulSoup
url = 'https://www.reddit.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
article = soup.find('div', attrs={"class": "y8HYJ-y_lTUHkQIc1mdCq _2INHSNB8V5eaWp4P0rY_mE"})
headline = article.a.h3.text
print(headline)

the error:

AttributeError: 'NoneType' object has no attribute 'a'

3 Answers3

1

Plz explain in simple words if possible.

AttributeError:

"There was an error that had to do with an attribute."

'NoneType' object

"It happened because you had something in your program that was the special None object,"

has no attribute 'a'

"and you tried to do .a with it, which isn't possible."

headline = article.a.h3.text
                  ^^

This is where you try to get .a from something, so that means that article is None.

article = soup.find('div', attrs={"class": "y8HYJ-y_lTUHkQIc1mdCq _2INHSNB8V5eaWp4P0rY_mE"})

This is how article gets its value, so that means that soup.find returned None.

Then you look at the documentation, and understand that this means that BeautifulSoup could not find a <div> tag with such a class attribute value in the HTML. So of course you can't find the nested <a> tag, because there's nothing for it to be nested in.

Chances are that the server generates the class name randomly; so you need to look at something else in the HTML in order to figure out what class name you actually need, and can't just rely on what it was the one time you viewed the page source.

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
1

You can use "old"-version of reddit to get the information (the new version uses javascript so BeautifulSoup doesn't parse some elements you see):

import requests
from bs4 import BeautifulSoup


url = 'https://old.reddit.com/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

print(soup.select_one('.entry a.title').text)

Prints:

Megathread: President Donald Trump announces he has tested positive for Coronavirus

Or: Using .json after URL:

import json
import requests


url = 'https://reddit.com/.json'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
data = requests.get(url, headers=headers).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

print(data['data']['children'][0]['data']['title'])

Note: Reddit has also API, so you don't have to use beautifulsoup.

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
-1

Adding a User-Agent maybe could help. Something like this:

headers={'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6'}

response = requests.get(url, headers)

You can find User-Agents here: https://webscraping.com/blog/User-agents/

Lndngr
  • 54
  • 5