2

I have written a very small python script that scrapes article headlines from CNN's website.

import requests
from bs4 import BeautifulSoup

url='https://edition.cnn.com/'
topics=['world','politics','business']
r=requests.get(url+topics[1])
soup=BeautifulSoup(r.content,'html.parser')
spans=soup.find_all('span',{'class':"cd__headline-text"})
print(spans)

Upon execution of this code I am simply getting an empty list as an output.This is not what I was expecting or looking for as I am trying to scrape the text that follows after the tag. The snippet of html block that I am trying to refer to is-

<span class="cd__headline-text">
Bernie Sanders faces pivotal clash as Democratic establishment joins forces against him
</span>

Please help clarify what my code seems to be doing wrong and/or any logical errors that I might be making.

Sid
  • 23
  • 4
  • 2
    A few things to always check when using `requests` to get the contents of a site. Have you checked the response from the site? Does `r` look like you expect? Then, have you checked the contents of `soup` prior to trying to find anything in it? These two checks can tell you if your `get` was successful, and whether the site is fully loaded in html, or loaded asynchronously when visited (the latter of which is likely with CNN), in which case you'll need a tool like selenium browser automation – G. Anderson Mar 03 '20 at 19:44
  • Hi @G.Anderson ! Thanks for your response . I am relatively new to we scrapping so I'm not sure what loading asynchronously means. Can you elaborate on that ? – Sid Mar 03 '20 at 19:48
  • 1
    Might be worth a quick google, but the high-level overview: Frameworks like Ajax (Asynchronous Java And XML) only load the page dynamically when it is visited by a web browser. This allows both customization of the user experience, and protection against things like, unfortunately for us, web-scraping. Check your `soup`, and I'd bet you'll see only a few HTML elements, since the rest of the page never actually loads unless a browser hits it. – G. Anderson Mar 03 '20 at 19:52
  • 1
    Does this answer your question? [BeautifulSoup4 doesn't find desired elements. What is the problem?](https://stackoverflow.com/questions/58197400/beautifulsoup4-doesnt-find-desired-elements-what-is-the-problem) – Pitto Mar 03 '20 at 19:52
  • If your matter is solved please mark an answer as accepted so that others can see that your question has been answered. – petezurich Mar 04 '20 at 07:18

1 Answers1

1

Your code runs fine. It just doesn't yield results for the politics page.

Try this:

import requests
from bs4 import BeautifulSoup

url='https://edition.cnn.com/'
topics = ['world','politics','business']

headlines = []

for topic in topics:

    r = requests.get(url+topic)
    soup=BeautifulSoup(r.content,'html.parser')

    for span in soup.find_all('span',{'class':"cd__headline-text"}):
        headlines.append(span.text)
        print(span.text)
        print()

headlines prints out to:

The bizarre ways that coronavirus is changing etiquette
Over half of all virus cases in one country are linked to this group
Trump's Middle East plan could jeopardize Jordan-Israel peace treaty, Jordan PM says
Irish duo's win marks rare victory for women in the 'Nobel of architecture'
After more than 240 days, Australia's New South Wales is finally free from bushfires
Child drowns off Greek coast after Turkey opens border with Europe 
A migration crisis and disagreement with Turkey is the last thing Europe needs right now
Vatican to open controversial WW2-era files on Pope Pius XII
Netanyahu projected to win Israeli election, but exit polls suggest bloc just short of majority
Adviser to Iran's Supreme Leader dies after contracting coronavirus
Israeli election exit polls project Netanyahu in lead
She became pregnant at the age of 12. Now, Kenya's Christine Ongare is an Olympic boxing qualifier
Nigeria says it is ready and more than capable of dealing with coronavirus
Kenya bans commercial slaughter of donkeys following a rise in animal theft 
Violence forces Haiti to cancel Carnival
....

You don't get results for politics because the content is rendered dynamically with Javascript in the browser (as G. Anderson explained in his comments). With requests however you only get the raw HTML.

Open the site in the browser and compare View page source with Inspect element. The former yields the raw HTML the latter the rendered HTML.

petezurich
  • 9,280
  • 9
  • 43
  • 57
  • Thank you this indeed does solve the problem. One further clarification, is it better to use Selenium for scraping dynamically rendered content or should I stick to beautiful soup – Sid Mar 04 '20 at 10:45
  • 1
    You´re welcome. And regarding your question: From my experience it very much depends. If I can scrape without Selenium I´d usually go this way because it is so much faster to scrape by using requests/BS4, even if I have may be to invest a little more time in parsing. I only use Selenium if I cannot avoid it. At the same time Selenium is very well maintained in documented and works really good. I suggest you give both options a try. It's definitely worth it and may it just be for the experience you get. Good luck with your project(s)! – petezurich Mar 04 '20 at 12:45