10

Suppose I have local copies of news articles. How can I run newspaper on those articles? According to the documentation, the normal use of the newspaper library looks something like this:

from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article.download()
article = Article(url)
article.parse()
# ...

In my case, I do not need to download the article from a web page because I already have a local copy of the page. How can I use newspaper on a local copy of the web page?

Flux
  • 9,805
  • 5
  • 46
  • 92

3 Answers3

10

There is indeed an official way to solve this as mentioned here

Once you've loaded your html in the program you can use the set_html() method to set it to article.html

import newspaper
with open("file.html", 'rb') as fh:
    ht = fh.read()
article = newspaper.Article(url = ' ')
article.set_html(ht)
article.parse()
LucyDrops
  • 539
  • 5
  • 15
6

You can, it's just a bit hacky. As an example

import requests
from newspaper import Article

url = 'https://www.cnn.com/2019/06/19/india/chennai-water-crisis-intl-hnk/index.html'

# get sample html
r = requests.get(url)

# save to file
with open('file.html', 'wb') as fh:
    fh.write(r.content)

a = Article(url)

# set html manually
with open("file.html", 'rb') as fh:
    a.html = fh.read()

# need to set download_state to 2 for this to work
a.download_state = 2

a.parse()

# Now the article should be populated
a.text

# 'New Delhi (CNN) The floor...'

Where the download_state comes from the snippet in newspaper.article.py:

# /path/to/site-packages/newspaper/article.py
class ArticleDownloadState(object):
    NOT_STARTED = 0
    FAILED_RESPONSE = 1
    SUCCESS = 2

~snip~

# This is why you need to set that variable
class Article:
    def __init__(...):
        ~snip~
         # Keep state for downloads and parsing
        self.is_parsed = False
        self.download_state = ArticleDownloadState.NOT_STARTED
        self.download_exception_msg = None

    def parse(self):
        # will throw exception if download_state isn't 2
        self.throw_if_not_downloaded_verbose()

        self.doc = self.config.get_parser().fromstring(self.html)

As an alternative, you could override the class to act just the same with the parse function:

from newspaper import Article
import io

class localArticle(Article):
    def __init__(self, url, **kwargs):
        # set url to be file_name in __init__ if it's a file handle
        super().__init__(url if isinstance(url, str) else url.name, **kwargs)
        # set standalone _url attr so that parse will work as expected
        self._url = url

    def parse(self):

        # sets html and things for you
        if isinstance(self._url, str):
            with open(self._url, 'rb') as fh:
                self.html = fh.read()

        elif isinstance(self._url, (io.TextIOWrapper, io.BufferedReader)):
            self.html = self._url.read()

        else:
            raise TypeError(f"Expected file path or file-like object, got {self._url.__class__}")

        self.download_state = 2
        # now parse will continue on with the proper params set
        super(localArticle, self).parse()


a = localArticle('file.html') # pass your file name here
a.parse()

a.text[:10]
# 'New Delhi '

# or you can give it a file handle
with open("file.html", 'rb') as fh:
    a = localArticle(fh)
    a.parse()

a.text[:10]
# 'New Delhi '
C.Nivs
  • 12,353
  • 2
  • 19
  • 44
  • I take that this is not officially supported and documented? – Flux Jun 20 '19 at 02:39
  • 1
    Doesn't look like it out of the box, at least. I haven't used this package in about a year and a half, and certainly not on local copies of html pages, so it could be in the documentation and I just missed it – C.Nivs Jun 20 '19 at 03:20
  • @Flux after re-reading the docs, no, this is not officially documented – C.Nivs Jun 20 '19 at 03:30
  • @Flux made an edit to the `localArticle` class so that parse behaves the same as the parent api – C.Nivs Jun 20 '19 at 13:28
4

I'm sure that you have solved this, but Newspaper has the capabilities to process locally stored HTML files.

from newspaper import Article

# Downloading the HTML for the article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
article.parse()
with open('fox13no.html', 'w') as fileout:
   fileout.write(article.html)

# Read the locally stored HTML with Newspaper
with open("fox13no.html", 'r') as f:
   # note the URL string is empty
   article = Article('', language='en')
   article.download(input_html=f.read())
   article.parse()
   print(article.title) 
   New Year, new laws: Obamacare, pot, guns and drones
Life is complex
  • 15,374
  • 5
  • 29
  • 58
  • I want to extract the text and other information from several email bodies using "article". In that case, getting the link of email is not possible. So how can I read the email as html file? – EMT Jul 28 '21 at 12:02
  • 1
    @EMT If I understand your question correctly, you want to use `newspaper` to extract content from an email that is in HTML? Emails and web articles have different structures. My gut feeling is that `newspaper` isn't a good choice for data extraction. Here are some of my [answers on extracting content from emails.](https://stackoverflow.com/search?q=user%3A6083423+emails) – Life is complex Jul 28 '21 at 13:07
  • Actually "newspaper" only takes url or html format as an input and it is very good library to extract text and also the keywords. That is why I want to read the whole email as html file. – EMT Jul 28 '21 at 13:14
  • 1
    @EMT Yes, `newspaper' can only extract content from either a URL or HTML that is formatted in a specific way. Please post a question with more details, including a link to a HTML file. Once the question is posted I can see if I can help you with your use case. – Life is complex Jul 28 '21 at 13:22
  • The problem is, I am trying to read the email body and apply "newspaper" on it. Emails contain some paragraphs or text on it. There are many emails and I have read it one by one. Posting the link of those email will not be helpful as it will require login credentials. I am actually stuck there. – EMT Jul 28 '21 at 13:28
  • 1
    @EMT Can you post a sanitized version of an email in a question? I'm very familiar with capabilities of `newspaper` and I don't think that this package is the right choice for your use case. I need to see more details before I can say either yes it will work or no it won't work. – Life is complex Jul 28 '21 at 13:33
  • I have just solved the issue by saving the file as html and read it back using newspaper. I can share with my solution if you want to see. Obviously it might not be the best solution, but right now for topic selection/keyword extraction, I do not know a better one. – EMT Jul 28 '21 at 14:15
  • 1
    @EMT I don't need to see the solution as long as it works for you. – Life is complex Jul 28 '21 at 14:22