I'm making text files consisting of the author, date of publication and main text of news articles. I have code to do this, but I need for Newspaper3k
to identify the relevant information from these articles first. Since user agent specification has been an issue before, I also specify the user agent. Here's my code so you can follow along. This is version 3.9.0
of Python.
import time, os, random, nltk, newspaper
from newspaper import Article, Config
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
config = Config()
config.browser_user_agent = user_agent
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
article = Article(url, config=config)
article.download()
#article.html #
article.parse()
article.nlp()
article.authors
article.publish_date
article.text
To better understand why this case is particularly puzzling, please substitute the link I've provided above with this one, and re-run the code. With this link, the code now runs correctly, returning the author, date and text. With the link in the code above, it doesn't. What am I overlooking here?