Newspaper3k, User Agents and Scraping

Question

I'm making text files consisting of the author, date of publication and main text of news articles. I have code to do this, but I need for Newspaper3k to identify the relevant information from these articles first. Since user agent specification has been an issue before, I also specify the user agent. Here's my code so you can follow along. This is version 3.9.0 of Python.

import time, os, random, nltk, newspaper 

from newspaper import Article, Config

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124  Safari/537.36'

config = Config()
config.browser_user_agent = user_agent

url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
article = Article(url, config=config)
article.download()
#article.html #
article.parse()
article.nlp()

article.authors
article.publish_date
article.text

To better understand why this case is particularly puzzling, please substitute the link I've provided above with this one, and re-run the code. With this link, the code now runs correctly, returning the author, date and text. With the link in the code above, it doesn't. What am I overlooking here?

LMK if you still need help with this question. – Life is complex Jul 27 '21 at 20:23 — Life is complex, Jul 27 '21 at 20:23

score 0 · Accepted Answer · answered Jul 20 '21 at 14:01

Apparently, Newspaper demands that we specify the language we're interested in. The code here still doesn't extract the author for some strange reason, but this is enough for me. Here's the code, if anyone else would benefit from it.


#
# Imports our modules
#

import time, os, random, nltk, newspaper
from newspaper import Article
from googletrans import Translator
translator = Translator()

# The link we're interested in

url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'


#
# Extracts the meta-data
#

article = Article(url, language='es')
article.download()
article.parse()
article.nlp()

#
# Makes these into strings so they'll get into the list
#

authors = str(article.authors)
date = str(article.publish_date)
maintext = translator.translate(article.summary).text


# Makes the list we'll append

elements = [authors+ "\n", date+ "\n", maintext+ "\n", url]

for x in elements:
    print(x)

The author tag is contained in a script tag, which requires additional code to extract. I published a [Newspaper Usage Document](https://github.com/johnbumgarner/newspaper3_usage_overview) on GitHub that discusses various collection strategies and other topics surrounding this library. — Life is complex, May 23 '22 at 21:00

Newspaper3k, User Agents and Scraping

1 Answers1