How to parse only a specific category of a website using the newspaper library?

Question

I use Python3 and the newspaper library. It is said that this library can create a Source object that is an abstraction of a news website. But what if I need only the abstraction of a certain category.

For example, when I use this url I want to get all of the articles of the 'technology' category. Instead, I get articles from 'politics'.

I think that when creating a Source object, newspaper uses only the domain name, which in my case is www.kyivpost.com).

Is there a way to make it work with urls like http://www.kyivpost.com/technology/?

Did you find a way out to get the categories using newspaper module, If so can you please post the answer — Henin RK, Aug 10 '16 at 10:42
Newspaper cannot do this *out of the box.* You would have to wrap some additional code around newspaper to query this single category on the Kyvi Post 's website. Plus a lot of articles under this category require a subscription to access, which creates another issue. — Life is complex, Mar 15 '21 at 13:51

score 0 · Answer 1 · answered Aug 10 '16 at 21:24

newspaper will use a site's rss feed when available; KyivPost only has one rss feed and posts articles mainly on politics, which is why your result set is mostly politics.

You may have more luck using BeautifulSoup to draw article URLs specifically from the technology page and feeding them to newspaper directly.

score 0 · Answer 2 · answered Jan 21 '20 at 07:15

I know this is a bit old. But if someone's still looking for something like this you can first get all the anchor tag elements filter links with a regex and then request all the links for articles + required data. I am pasting a sample code you can change the necessary soup elements according to your page-
'''

"""
Created on Tue Jan 21 10:10:02 2020

@author: prakh
"""

import requests
#import csv
from bs4 import BeautifulSoup
import re
from functools import partial  
from operator import is_not
from dateutil import parser
import pandas as pd
from datetime import timedelta, date

final_url = 'https://www.kyivpost.com/technology'

links = []
news_data = []
filter_null = partial(filter, partial(is_not, None))

try:
    page = requests.get(final_url)

    soup = BeautifulSoup(page.text, 'html.parser')

    last_links = soup.find(class_='filter-results-archive')

    artist_name_list_items = last_links.find_all('a')
    for artist_name in artist_name_list_items:

        links.append(artist_name.get('href'))
        L =list(filter_null(links))

        regex = re.compile(r'technology')

        selected_files = list(filter(regex.match, L))
#            print(selected_files)     
#        print(list(page))
except Exception as e:
    print(e)
    print("continuing....")
#    continue

for url in selected_files:
        news_category = url.split('/')[-2]
        try:
            data = requests.get(url)
            soup = BeautifulSoup(data.content, 'html.parser')

            last_links2 = soup.find(id='printableAreaContent')                
            last_links3 = last_links2.find_all('p')
#            metadate = soup.find('meta', attrs={'name': 'publish-date'})['content']
            #print(metadate)
#            metadate = parser.parse(metadate).strftime('%m-%d-%Y')
#            metaauthor = soup.find('meta', attrs={'name': 'twitter:creator'})['content']
            news_articles = [{'news_headline': soup.find('h1', 
                                                         attrs={"class": "post-title"}).string,
                          'news_article':  last_links3,
 #                        'news_author':  metaauthor,
#                          'news_date': metadate,
                            'news_category': news_category}
                        ]

            news_data.extend(news_articles)        
#        print(list(page))
        except Exception as e:
            print(e)
            print("continuing....")
            continue

df =  pd.DataFrame(news_data)
'''

How to parse only a specific category of a website using the newspaper library?

2 Answers2