Library: Newspaper (Newspaper3k) Trying to Parse Links From Main Page of News (Source) Site

Question

I am trying to parse from a set of links generated by using the python library called Newspaper

Goal:

To parse every link from the main page (or specific page such as category) of a news site.

Problem:

I generate an AttributeError when attempting to pass an 'article_link' into the 'Article()' method.
Using separate code to parse a single link from 'The New York Times', the text printed does not print the whole article.

Code Producing Problem 1:

import newspaper
from newspaper import Article

nyt_paper = newspaper.build(
    'http://nytimes.com/section/todayspaper', memoize_articles=False)
print(nyt_paper.size())

processed_link_list = []
for article_link in nyt_paper.articles:
    article = Article(url=article_link)
    article.download()
    article.html
    article.parse()
    print(article.authors)
    processed_link_list.append(article_link)

if len(nyt_paper.size()) is len(processed_link_list):
    print('All Links Processed')
else:
    print('All Links **NOT** Processed')

Error Output:

Traceback (most recent call last):
  File "nyt_today.py", line 31, in <module>
    article = Article(url=article_link)
  File "C:\...\lib\site-packages\newspaper\article.py", line 60, in __init__
    scheme = urls.get_scheme(url)
  File "C:\...\lib\site-packages\newspaper\urls.py", line 279, in get_scheme
    return urlparse(abs_url, **kwargs).scheme
  File "C:\...\lib\urllib\parse.py", line 367, in urlparse
    url, scheme, _coerce_result = _coerce_args(url, scheme)
  File "C:\...\lib\urllib\parse.py", line 123, in _coerce_args
    return _decode_args(args) + (_encode_result,)
  File "C:\...\lib\urllib\parse.py", line 107, in _decode_args
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
  File "C:\...\lib\urllib\parse.py", line 107, in <genexpr>
    return tuple(x.decode(encoding, errors) if x else '' for x in args)
AttributeError: 'Article' object has no attribute 'decode'

Code Producing Problem 2:

from newspaper import Article
from newspaper import fulltext
import requests

nyt_url = 'https://www.nytimes.com/2019/02/26/opinion/trump-kim-vietnam.html'
article = Article(nyt_url)
article.download()
print(article.html)
article.parse()
print(article.authors)
print(article.text)

I have also tried this 'fulltext' method exampled in the documentation to print the text:

article_html = requests.get(nyt_url).text
full_text = fulltext(article_html)
print(full_text)

However, although the Entire article text is ouput to the

print(article.html)

the

print(article.text)

does not print it all. The original link, HTML Output and Printed Text Output can be seen below:

Link: https://www.nytimes.com/2019/02/26/opinion/trump-kim-vietnam.html

Html Output: see this pastebin for truncated output

Printed text: see this printed text does not print the entire article

Any help would be much appreciated.

Hello. Were you able to parse all the link from the main page at last ? I am trying to do the same, in python 3 — Proteeti Prova, Sep 22 '19 at 06:40
Hey, No. I never got it to work. Shame. It is possible that it is a blacklist problem. Consider using Proxy/Useragent? see: [This Stackoverflow Link](https://stackoverflow.com/questions/56678732/how-to-fix-newspaper3k-403-client-error-for-certain-urls) Let me know if you get it to work...would like to get it to work. Maybe we can troubleshoot together...but I am green/newbie? — R.Zane, Sep 23 '19 at 23:15

score 1 · Answer 1 · answered Apr 22 '20 at 05:51

NYTimes has changed it's internal html structure since 2014. Newspaper3K will work fine if you try to parse articles published before 2014.

Other things to take into account:

1980 articles are not available.
Articles before 1970 are not digitized (except 1964).
1970-1979 articles have lots of words splitted in the middle by a space.
If you parse with Newspaper3k several articles will contain only "NYTimes.com no longer supports Internet Explorer 9 or earlier. Please upgrade your browser."
Lot of articles will have the following texts inserted in the middle:

"\n\nNewsletter Sign Up Continue reading the main story Sign Up for the Opinion Today Newsletter Every weekday, get thought-provoking commentary from Op-Ed columnists, the Times editorial board and contributing writers from around the world. Please verify you're not a robot by clicking the box. Invalid email address. Please re-enter. You must select a newsletter to subscribe to. Sign Up You will receive emails containing news content , updates and promotions from The New York Times. You may opt-out at any time. You agree to receive occasional updates and special offers for The New York Times's products and services. Thank you for subscribing. An error has occurred. Please try again later. View all New York Times newsletters.\n\n"

"\n\nNewsletter Sign Up Continue reading the main story Please verify you're not a robot by clicking the box. Invalid email address. Please re-enter. You must select a newsletter to subscribe to. Sign Up You will receive emails containing news content , updates and promotions from The New York Times. You may opt-out at any time. You agree to receive occasional updates and special offers for The New York Times's products and services. Thank you for subscribing. An error has occurred. Please try again later. View all New York Times newsletters.\n"

Most blogs (blogs appear in 2010) will have also undesired texts inserted.

If you are ok with data from 1990 to 2016 check the dataset used in this paper: https://arxiv.org/abs/1703.00607 it's available online.

In case you need newer articles I thing you should write your own parser. I'm working on it but I didn't finished yet.