Scraping the news titles from news websites

Question

I've been trying to scrape news titles from the news websites. For that I've come across two python libraries i.e newspaper and beautifulsoup4. Using the beautiful soup library, I've been able to get all the links from a particular news website that lead to news articles. From the code below I've been able to extract the title of a news article from a single link.

from newspaper import Article
url= "https://www.ndtv.com/india-news/tamil-nadu-government-reverses-decision-to-reopen-schools-from-november-16-for-classes-9-12-news-agency-pti-2324199"
article=Article(url)
article.download()
article.parse()
print(article.title)

I want to combine the code from both the libraries i.e, newspaper and beautifulsoup4, such that all the links that I get as an output from beautifulsoup library, should be placed in the url command in the newspaper library and I get all the titles of the links. Below is the code of beautfulsoup from which I've been able to extract all the links to the news articles.

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("https://www.ndtv.com/coronavirus?pfrom=home-mainnavgation")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

I recently started putting together a detailed Newspaper3k usage document that shared publicly. This document is available here: https://github.com/johnbumgarner/newspaper3_usage_overview. You might find it useful when using Newspaper. — Life is complex, Nov 22 '20 at 17:30
@Lifeiscomplex Hello, can you help me a lil bit with this lib. I do not know what am i doing wrong: https://stackoverflow.com/questions/65110807/newspaper3k-lib-for-article-parsing-does-not-return-data — taga, Dec 02 '20 at 19:57

score 1 · Accepted Answer · answered Nov 20 '20 at 11:20

1

Do you mean something like this?

links = []
for link in soup.find_all('a', href=True):
    links.append(link['href'])

for link in links:
    article=Article(link)
    article.download()
    article.parse()
    print(article.title)

answered Nov 20 '20 at 11:20

Abhishek Rai

2,159
3
18
38

Combination of the code in the question and this gave rise to the print of page titles, like Main Page, Celebrities, Spor News etc. The main purpose of coming to this question might be more of looking at titles of the news, rather than these. – Güray Hatipoğlu Aug 27 '23 at 07:35

Scraping the news titles from news websites

1 Answers1