I am trying to parse from a set of links generated by using the python library called Newspaper
Goal:
To parse every link from the main page (or specific page such as category) of a news site.
Problem:
- I generate an AttributeError when attempting to pass an 'article_link' into the 'Article()' method.
- Using separate code to parse a single link from 'The New York Times', the text printed does not print the whole article.
Code Producing Problem 1:
import newspaper
from newspaper import Article
nyt_paper = newspaper.build(
'http://nytimes.com/section/todayspaper', memoize_articles=False)
print(nyt_paper.size())
processed_link_list = []
for article_link in nyt_paper.articles:
article = Article(url=article_link)
article.download()
article.html
article.parse()
print(article.authors)
processed_link_list.append(article_link)
if len(nyt_paper.size()) is len(processed_link_list):
print('All Links Processed')
else:
print('All Links **NOT** Processed')
Error Output:
Traceback (most recent call last):
File "nyt_today.py", line 31, in <module>
article = Article(url=article_link)
File "C:\...\lib\site-packages\newspaper\article.py", line 60, in __init__
scheme = urls.get_scheme(url)
File "C:\...\lib\site-packages\newspaper\urls.py", line 279, in get_scheme
return urlparse(abs_url, **kwargs).scheme
File "C:\...\lib\urllib\parse.py", line 367, in urlparse
url, scheme, _coerce_result = _coerce_args(url, scheme)
File "C:\...\lib\urllib\parse.py", line 123, in _coerce_args
return _decode_args(args) + (_encode_result,)
File "C:\...\lib\urllib\parse.py", line 107, in _decode_args
return tuple(x.decode(encoding, errors) if x else '' for x in args)
File "C:\...\lib\urllib\parse.py", line 107, in <genexpr>
return tuple(x.decode(encoding, errors) if x else '' for x in args)
AttributeError: 'Article' object has no attribute 'decode'
Code Producing Problem 2:
from newspaper import Article
from newspaper import fulltext
import requests
nyt_url = 'https://www.nytimes.com/2019/02/26/opinion/trump-kim-vietnam.html'
article = Article(nyt_url)
article.download()
print(article.html)
article.parse()
print(article.authors)
print(article.text)
I have also tried this 'fulltext' method exampled in the documentation to print the text:
article_html = requests.get(nyt_url).text
full_text = fulltext(article_html)
print(full_text)
However, although the Entire article text is ouput to the
print(article.html)
the
print(article.text)
does not print it all. The original link, HTML Output and Printed Text Output can be seen below:
Link: https://www.nytimes.com/2019/02/26/opinion/trump-kim-vietnam.html
Html Output: see this pastebin for truncated output
Printed text: see this printed text does not print the entire article
Any help would be much appreciated.