How do I specify what div tag I want to grab?

Question

So I'm trying to grab the headline and articles summary off of this website and I so far I know how to get headlines that are within article tags > h2 tags> a tags but I'm not sure how to get the headline when there's multiple div tags within this article tag. I've left the articles link below so you can hopefully see what I mean. Usually I'd go headline = article.h2.a.text but this has article tag has 2 div tags and it's very frustrating to not know how to tackle this at all. My thought process for this was to start by specifying the article tag and then the div tag I wanted to access followed by the h1 tag that holds the headline text but that didn't work. I'd imagine this is the correct way of viewing this problem but I'm just not going about it properly. I know I'm definitely missing something but I just don't know what. Any help or resources would be extremely helpful.

ARTICLE: https://www.huffpost.com/entry/angry-squirrel-attacks-queens_n_5fee30b1c5b6ec8ae0b242d2

Here's my code:

from bs4 import BeautifulSoup
import requests import csv

source = requests.get('https://www.huffpost.com/entry/angry-squirrel-attacks-queens_n_5fee30b1c5b6ec8ae0b242d2').text

soup = BeautifulSoup(source, 'lxml')


article = soup.find('article') 
headline = article.find('div', class_='headline js-headline').h1.text 

print(headline)

Error:

Traceback (most recent call last): File "C:\Users\Denze\MyPythonScripts\Webscraping learning\Webscrape article.py", line 12, in headline = article.find('div', class_='headline__title cc_cursor').h1.text AttributeError: 'NoneType' object has no attribute 'find'

I think the solution you are after is in the link below: https://stackoverflow.com/questions/57462036/how-can-i-bypass-a-cookie-agreement-page-while-web-scraping-using-python Your current code maybe reaching a cookie consent page and hence getting a different html data than what you were expecting. Look at the `soup` output to confirm if that is the case and use the above link to resolve. — sudhish, Jan 01 '21 at 13:22

dimay · Accepted Answer · 2021-01-01T19:38:24.523

0

When you look at the status of requests::

source = requests.get('https://www.huffpost.com/entry/angry-squirrel-attacks-queens_n_5fee30b1c5b6ec8ae0b242d2')

print(source)

Output:

<Response [403]>

Try to set user agent:

from bs4 import BeautifulSoup
import requests 
import csv

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
    }

source = requests.get('https://www.huffpost.com/entry/angry-squirrel-attacks-queens_n_5fee30b1c5b6ec8ae0b242d2', headers=headers)
if source.status_code == 200:
    soup = BeautifulSoup(source.text, 'lxml')


    article = soup.find('article') 
    headline = article.find('div', class_='headline js-headline').h1.text 

    print(headline)
else:
    print(f"The requests status is: {source.status_code}")

edited Jan 01 '21 at 19:38

answered Jan 01 '21 at 09:25

dimay

2,768
1
13
22

made some change – dimay Jan 01 '21 at 19:38
This works! Turns out it was just an issue with the user agent. (headers =...) All of the other code isn't necessary. I'll save this for future reference though. – Johnny Silverhand Jan 01 '21 at 19:54

score 0 · Answer 2 · answered Jan 01 '21 at 13:25

I think the solution you are after is in the link below:

How can I bypass a cookie agreement page while web scraping using Python?

Your current code maybe reaching a cookie consent page and hence getting a different html data than what you were expecting.
Look at the soup output to confirm if that is the case and use the above link to resolve.

How do I specify what div tag I want to grab?

2 Answers2