4

I can't download articles like one usually does to instantiate the Article object, like below:

from newspaper import Article
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.download()
article.top_image

However, I can get the HTML from a request. Can I use this raw HTML and pass it somehow to Newspaper to extract the image from it? (below is an attempt, but doesn't work). Thanks

from newspaper import Article
import requests
url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html= requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.set_html(raw_html)
article.top_image
notverygood
  • 297
  • 2
  • 13
  • 1
    why does it not work? Which error are you getting ? – Ciprian Tomoiagă Sep 14 '20 at 10:46
  • I can't inject my company's internal SLL certificate key to my request. The issue is being looked into. The only workaround is to make a request manually and pass `verify=False`, which gives me the raw HTML – notverygood Sep 14 '20 at 10:52

3 Answers3

4

The Python module Newspaper allows proxies to be used, but this feature is not listed within the module's documentation.


Proxies with Newspaper

from newspaper import Article
from newspaper.configuration import Configuration

# add your corporate proxy information and test the connection
PROXIES = {
           'http': "http://ip_address:port_number",
           'https': "https://ip_address:port_number"
          }

config = Configuration()
config.proxies = PROXIES

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
articles = Article(url, config=config)
articles.download()
articles.parse()
print(articles.top_image)
https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg

Requests with Proxies and Newspaper

import requests
from newspaper import Article

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
raw_html = requests.get(url, verify=False, proxies=proxy)
article = Article('')
article.download(raw_html.content)
article.parse()
print(article.top_image) https://ewscripps.brightspotcdn.com/dims4/default/d49dab0/2147483647/strip/true/crop/400x210+0+8/resize/1200x630!/quality/90/?url=http%3A%2F%2Fmediaassets.fox13now.com%2Ftribune-network%2Ftribkstu-files-wordpress%2F2012%2F04%2Fnational-news-e1486938949489.jpg
Life is complex
  • 15,374
  • 5
  • 29
  • 58
0

I think here is what you want:

from newspaper import fulltext

html = 'your html'

text_from_html = fulltext(html)
Andrew Anderson
  • 1,044
  • 3
  • 17
  • 26
-1

First of make sure you're using python3, that you have run pip3 install newspaper3k before.

Then if you're getting SSL errors with the first version (like below)

/usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py:981: InsecureRequestWarning: Unverified HTTPS request is being made to host 'fox13now.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings warnings.warn(

you can disable them by adding

import urllib3
urllib3.disable_warnings()

This should work:

from newspaper import Article
import urllib3
urllib3.disable_warnings()


url = "https://www.fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/"
article = Article(url)
article.download()
print(article.html)

Run with python3 <yourfile>.py.


Setting the html in the Article yourself won't do you much good, as you'd not get anything in the other fields that way. Let me know if that fixes the issue, or if any other errors pop up!

char
  • 2,063
  • 3
  • 15
  • 26
  • The reason I'm not able to download using the Newspapers is because I'm behind a corporate proxy. I tried injecting the SSL certificate numerous ways. The only way I can get past for now is using `verify=False`, in a request, which will obviously have to change down the line. I can run Newspaper's `summary` on raw HTML, so my intuition is I should be able to get the image using raw HTML also. – notverygood Sep 14 '20 at 10:36
  • Ah, that complicates things. Can you use fulltext? `from newspaper import fulltext; html = requests.get(...).text; text = fulltext(html)` – char Sep 14 '20 at 11:49
  • Yes I can do that. If your run the second snippet of code, you should be able to test out which functions from Article work from raw HTML and which don't. – notverygood Sep 14 '20 at 11:56
  • Other option might be to add a custom version of Article (see the last code block on [this blog](https://www.codeleading.com/article/95122429713/). – char Sep 14 '20 at 11:58