2

I have installed Newspapper3k Lib on my Mac with sudo pip3 install Newspapper3k. Im using Python 3. I want to return data thats supported at Article object, and that is url, date, title, text, summarisation and keywords but I do not get any data:

import newspaper
from newspaper import Article

#creating website for scraping
cnn_paper = newspaper.build('https://www.euronews.com/', memoize_articles=False)

#I have tried for https://www.euronews.com/, https://edition.cnn.com/, https://www.bbc.com/


for article in cnn_paper.articles:

    article_url = article.url #works

    news_article = Article(article_url)#works

    print("OBJECT:", news_article, '\n')#works
    print("URL:", article_url, '\n')#works
    print("DATE:", news_article.publish_date, '\n')#does not work
    print("TITLE:", news_article.title, '\n')#does not work
    print("TEXT:", news_article.text, '\n')#does not work
    print("SUMMARY:", news_article.summary, '\n')#does not work
    print("KEYWORDS:", news_article.keywords, '\n')#does not work
    print()
    input()

I get Article object and URL but everything else is ''. I have tried on different websites, but result is the same.

Then I tried to add:

    news_article.download()
    news_article.parse()
    news_article.nlp()

I have also tried to set Config and to set HEADERS and TIMEOUTs but results are the same.

When I do that, for each website I get only 16 Articles with date, title, and body values. That is very strange to me, for each website I'm getting the same number of data, but for more than 95% of news articles I'm getting None.

Can Beautiful Soup help me?

Can someone help me with understanding what is the problem, why I'm getting so much Null/Nan/"" values, and how can I fix that?

This is the docs for lib:

https://newspaper.readthedocs.io/en/latest/

taga
  • 3,537
  • 13
  • 53
  • 119

1 Answers1

5

I would recommend that you review the newspaper overview document that I published on GitHub. The document has multiple extraction examples and other techniques that might be useful.

Concerning your question...

Newspaper3K will parse certain websites nearly flawlessly. But there are plenty of websites that will require reviewing a page's navigational structure to determine how to parse the article elements correctly.

For instance, https://www.marketwatch.com has individual article elements, such as title, publish date and others items stored within the meta tag section of the page.

The newspaper example below will parse the elements correctly. I noted that you might need to do some data cleaning of the keyword or tag output.

import newspaper
from newspaper import Config
from newspaper import Article

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

base_url = 'https://www.marketwatch.com'
article_urls = set()
marketwatch = newspaper.build(base_url, config=config, memoize_articles=False, language='en')
for sub_article in marketwatch.articles:
article = Article(sub_article.url, config=config, memoize_articles=False, language='en')
article.download()
article.parse()
if article.url not in article_urls:
    article_urls.add(article.url)

    # The majority of the article elements are located
    # within the meta data section of the page's
    # navigational structure
    article_meta_data = article.meta_data

    published_date = {value for (key, value) in article_meta_data.items() if key == 'parsely-pub-date'}
    article_published_date = " ".join(str(x) for x in published_date)

    authors = sorted({value for (key, value) in article_meta_data.items() if key == 'parsely-author'})
    article_author = ', '.join(authors)

    title = {value for (key, value) in article_meta_data.items() if key == 'parsely-title'}
    article_title = " ".join(str(x) for x in title)

    keywords = ''.join({value for (key, value) in article_meta_data.items() if key == 'keywords'})
    keywords_list = sorted(keywords.lower().split(','))
    article_keywords = ', '.join(keywords_list)

    tags = ''.join({value for (key, value) in article_meta_data.items() if key == 'parsely-tags'})
    tag_list = sorted(tags.lower().split(','))
    article_tags = ', '.join(tag_list)

    summary = {value for (key, value) in article_meta_data.items() if key == 'description'}
    article_summary = " ".join(str(x) for x in summary)

    # the replace is used to remove newlines
    article_text = article.text.replace('\n', '')
    print(article_text)

https://www.euronews.com is similar to https://www.marketwatch.com, except some of the article elements are located in the main body and other items are within the meta tag section.

import newspaper
from newspaper import Config
from newspaper import Article

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

base_url = 'https://www.euronews.com'
article_urls = set()
euronews = newspaper.build(base_url, config=config, memoize_articles=False, language='en')
for sub_article in euronews.articles:
   if sub_article.url not in article_urls:
     article_urls.add(sub_article.url)
     article = Article(sub_article.url, config=config, memoize_articles=False, language='en')
     article.download()
     article.parse()

     # The majority of the article elements are located
     # within the meta data section of the page's
     # navigational structure
     article_meta_data = article.meta_data
    
     published_date = {value for (key, value) in article_meta_data.items() if key == 'date.created'}
     article_published_date = " ".join(str(x) for x in published_date)
    
     article_title = article.title

     summary = {value for (key, value) in article_meta_data.items() if key == 'description'}
     article_summary = " ".join(str(x) for x in summary)

     keywords = ''.join({value for (key, value) in article_meta_data.items() if key == 'keywords'})
     keywords_list = sorted(keywords.lower().split(','))
     article_keywords = ', '.join(keywords_list).strip()

     # the replace is used to remove newlines
     article_text = article.text.replace('\n', '')
Life is complex
  • 15,374
  • 5
  • 29
  • 58
  • Thanks, is there more chances to get article values if I pass html from beautiful soup to Article? And what are request limitations, is there max amount of requests that I can post? – taga Dec 03 '20 at 07:32
  • Also, how can I extract meta_keywords and author names, and is there a way to identify language of the news (for example if i start collecting all news on website, and website has news on english, french etc. can I select only news on english)? – taga Dec 03 '20 at 07:55
  • 1
    @taga I noted when I queried euronews that I got multiple languages with newspaper.build, but I believe that changed in the next round of using Article(). In some cases you might need to use selenium to 'click' something on the website to get the desired language. – Life is complex Dec 03 '20 at 13:14
  • 1
    @taga my examples above show how to extract meta items. Some items (e.g., keywords, author) might exist in the meta tags, sometimes these won't exist and required additional code to scrape. My Github provides lots of examples (e.g., beautiful soup) so please review that document. – Life is complex Dec 03 '20 at 13:21
  • 1
    @taga and concerning the "request limitations" for a source. That is something that you will have to gauge yourself, because any source can drop your connection if you hit it to hard. I normally add a time.sleep piece in my code when that starts happening. – Life is complex Dec 03 '20 at 13:26
  • What does article.download() do? And where does it download? – taga Dec 04 '20 at 01:35
  • it downloads the article text in a temp file, so it can processed. – Life is complex Dec 04 '20 at 02:54
  • but where is that stored? – taga Dec 04 '20 at 10:08
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/225511/discussion-between-taga-and-life-is-complex). – taga Dec 04 '20 at 10:08
  • Is there an another way to get article text. I saw that you are getting the text with article_text = article.text , but is there another way? How are you recognizing what is article text/body and what is not? – taga Mar 09 '21 at 14:56
  • @taga Newspaper is programmed to extract the article text based on predefined parameters (e.g. tag types). Sometimes Newspaper will not harvest the article text. In these rare situations you have to use additional code (e.g. bs4) to support the extraction. – Life is complex Mar 09 '21 at 20:36
  • Hello One question, is there a way to get list of all websites that can be crawled with this lib? Is there any list that provides domains of the websites that work good in this lib? – taga Jul 18 '21 at 16:41
  • @taga unfortunately, there is no list. – Life is complex Jul 18 '21 at 17:00