Get web article information (content , title, ...) from multiple web pages-python code

Question

There is a python Library - Newspaper3k, which makes life easier to get content of web pages. [newspaper][1]

for title retrieval:

import newspaper
a = Article(url)
print(a.title)

for content retrieval:

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.text

I want get info about web pages (sometimes title, sometimes actual content)there is my code to fetch content/text of web pages:

from newspaper import Article
import nltk
nltk.download('punkt')
fil=open("laborURLsml2.csv","r") 
# 3, below read every line in fil
Lines = fil.readlines()
for line in Lines:
    print(line)
    article = Article(line)
    article.download()
    article.html
    article.parse()
    print("[[[[[")
    print(article.text)
    print("]]]]]")

The content of "laborURLsml2.csv" file is: [laborURLsml2.csv][2]

My issue is: my code reads first URL and prints content but failed to read 2 URL on-wards

Do you see any exception being thrown when processing first url? — bigbounty, Jan 11 '21 at 00:17
yes, this exception was thrown : " raise ArticleException('Article `download()` failed with %s on URL %s' % ArticleException: Article `download()` failed with 404 Client Error: Not Found for url: https://www.socialeurope.eu/gig-workers-rights-and-their-strategic-litigation%0A on URL https://www.socialeurope.eu/gig-workers-rights-and-their-strategic-litigation" — tursunWali, Jan 11 '21 at 00:47
Can you put the complete exception message? Or wrap the processing part of the `for` loop in a `try/except` block — bigbounty, Jan 11 '21 at 00:48
Yes, I wrap for loop in a try/except block. and put all URLs of "laborURLsml2.csv " in a list. It works. I think the newsletter3k library is sensite to special characters such as "/" at the end of URL — tursunWali, Jan 11 '21 at 19:11
@tursunWali newspaper3k isn't sensitive to the special character "/" at the end of URL, but it is sensitive to trailing whitespaces like in your CSV. Which I removed with .strip() in my answer. It's also good practice to use a "USER_AGENT and timeout when using Newspaper. I noted that you need to do some data cleaning when extracting the article's text. — Life is complex, Jan 31 '21 at 21:57

Life is complex · Accepted Answer · 2021-01-25T21:22:11.807

1

I noted that some of the URLs in your CSV file have a trailing whitespace, which was causing an issue. I also noted that one of your links isn't available and others are the same story distributed to subsidiaries for publication.

The code below handles the first two issues, but it doesn't handle the data redundancy issue.

from newspaper import Config
from newspaper import Article
from newspaper import ArticleException

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

with open('laborURLsml2.csv', 'r') as file:
    csv_file = file.readlines()
    for url in csv_file:
        try:
            article = Article(url.strip(), config=config)
            article.download()
            article.parse()
            print(article.title)
            # the replace is used to remove newlines
            article_text = article.text.replace('\n', ' ')
            print(article_text)
        except ArticleException:
            print('***FAILED TO DOWNLOAD***', article.url)

You might find this newspaper3K overview document that I created and shared on my Github page useful.

edited Jan 25 '21 at 21:22

answered Jan 25 '21 at 19:50

Life is complex

15,374
5
29
58

thank you. Is there a way that newsletter can handle stored text? – tursunWali Feb 04 '21 at 05:17
What do you mean by "stored text?" – Life is complex Feb 04 '21 at 05:17
by stored text, I mean that there are texts/contents in .txt format or .csv format. Let's say, in each csv element is one text: " text ", " text 2", " text 3" . How to use "article.keyword" , "article.summary" on these text? – tursunWali Feb 05 '21 at 05:45
@tursunWali yes, this can be done. Open another question and I will answer it for you after you accept this question. – Life is complex Feb 05 '21 at 14:01
thank you for the above solution is great. I opened a question. – tursunWali Feb 09 '21 at 04:36

Get web article information (content , title, ...) from multiple web pages-python code

1 Answers1