1

There is a python Library - Newspaper3k, which makes life easier to get content of web pages. [newspaper][1]

for title retrieval:

import newspaper
a = Article(url)
print(a.title)

for content retrieval:

url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'
article = Article(url)
article.text

I want get info about web pages (sometimes title, sometimes actual content)there is my code to fetch content/text of web pages:

from newspaper import Article
import nltk
nltk.download('punkt')
fil=open("laborURLsml2.csv","r") 
# 3, below read every line in fil
Lines = fil.readlines()
for line in Lines:
    print(line)
    article = Article(line)
    article.download()
    article.html
    article.parse()
    print("[[[[[")
    print(article.text)
    print("]]]]]")

The content of "laborURLsml2.csv" file is: [laborURLsml2.csv][2]

My issue is: my code reads first URL and prints content but failed to read 2 URL on-wards

Jonas
  • 121,568
  • 97
  • 310
  • 388
tursunWali
  • 71
  • 8
  • Do you see any exception being thrown when processing first url? – bigbounty Jan 11 '21 at 00:17
  • yes, this exception was thrown : " raise ArticleException('Article `download()` failed with %s on URL %s' % ArticleException: Article `download()` failed with 404 Client Error: Not Found for url: https://www.socialeurope.eu/gig-workers-rights-and-their-strategic-litigation%0A on URL https://www.socialeurope.eu/gig-workers-rights-and-their-strategic-litigation" – tursunWali Jan 11 '21 at 00:47
  • Can you put the complete exception message? Or wrap the processing part of the `for` loop in a `try/except` block – bigbounty Jan 11 '21 at 00:48
  • Yes, I wrap for loop in a try/except block. and put all URLs of "laborURLsml2.csv " in a list. It works. I think the newsletter3k library is sensite to special characters such as "/" at the end of URL – tursunWali Jan 11 '21 at 19:11
  • @tursunWali newspaper3k isn't sensitive to the special character "/" at the end of URL, but it is sensitive to trailing whitespaces like in your CSV. Which I removed with .strip() in my answer. It's also good practice to use a "USER_AGENT and timeout when using Newspaper. I noted that you need to do some data cleaning when extracting the article's text. – Life is complex Jan 31 '21 at 21:57

1 Answers1

1

I noted that some of the URLs in your CSV file have a trailing whitespace, which was causing an issue. I also noted that one of your links isn't available and others are the same story distributed to subsidiaries for publication.

The code below handles the first two issues, but it doesn't handle the data redundancy issue.

from newspaper import Config
from newspaper import Article
from newspaper import ArticleException

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

with open('laborURLsml2.csv', 'r') as file:
    csv_file = file.readlines()
    for url in csv_file:
        try:
            article = Article(url.strip(), config=config)
            article.download()
            article.parse()
            print(article.title)
            # the replace is used to remove newlines
            article_text = article.text.replace('\n', ' ')
            print(article_text)
        except ArticleException:
            print('***FAILED TO DOWNLOAD***', article.url)

You might find this newspaper3K overview document that I created and shared on my Github page useful.

Life is complex
  • 15,374
  • 5
  • 29
  • 58