I'm trying to make a web scraper and I keep getting this error

Question

I'm trying to make a web scraper with beautiful soup that will print out the most popular post on reddit and I keep getting an error. Plz explain in simple words if possible. here's the code:

import requests
from bs4 import BeautifulSoup
url = 'https://www.reddit.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
article = soup.find('div', attrs={"class": "y8HYJ-y_lTUHkQIc1mdCq _2INHSNB8V5eaWp4P0rY_mE"})
headline = article.a.h3.text
print(headline)

the error:

AttributeError: 'NoneType' object has no attribute 'a'

`soup.find(....)` returned `None` (nothing was found). It's all with your error. — MarianD, Oct 02 '20 at 11:57

score 1 · Answer 1 · answered Oct 02 '20 at 12:05

Plz explain in simple words if possible.

AttributeError:

"There was an error that had to do with an attribute."

'NoneType' object

"It happened because you had something in your program that was the special None object,"

has no attribute 'a'

"and you tried to do .a with it, which isn't possible."

headline = article.a.h3.text
                  ^^

This is where you try to get .a from something, so that means that article is None.

article = soup.find('div', attrs={"class": "y8HYJ-y_lTUHkQIc1mdCq _2INHSNB8V5eaWp4P0rY_mE"})

This is how article gets its value, so that means that soup.find returned None.

Then you look at the documentation, and understand that this means that BeautifulSoup could not find a <div> tag with such a class attribute value in the HTML. So of course you can't find the nested <a> tag, because there's nothing for it to be nested in.

Chances are that the server generates the class name randomly; so you need to look at something else in the HTML in order to figure out what class name you actually need, and can't just rely on what it was the one time you viewed the page source.

score 1 · Answer 2 · answered Oct 02 '20 at 12:11

You can use "old"-version of reddit to get the information (the new version uses javascript so BeautifulSoup doesn't parse some elements you see):

import requests
from bs4 import BeautifulSoup


url = 'https://old.reddit.com/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

print(soup.select_one('.entry a.title').text)

Prints:

Megathread: President Donald Trump announces he has tested positive for Coronavirus

Or: Using .json after URL:

import json
import requests


url = 'https://reddit.com/.json'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
data = requests.get(url, headers=headers).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

print(data['data']['children'][0]['data']['title'])

Note: Reddit has also API, so you don't have to use beautifulsoup.

score -1 · Answer 3 · answered Oct 02 '20 at 11:59

-1

Adding a User-Agent maybe could help. Something like this:

headers={'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6'}

response = requests.get(url, headers)

You can find User-Agents here: https://webscraping.com/blog/User-agents/

answered Oct 02 '20 at 11:59

Lndngr

54
5

omg y'all are awesome. y'all helped me out a lot. I hope y'all have a great day :) – sneaky dude Oct 02 '20 at 12:26

I'm trying to make a web scraper and I keep getting this error

3 Answers3