4

I'm using the Newspaper module for python found here.

In the tutorials, it describes how you can pool the building of different newspapers s.t. it generates them at the same time. (see the "Multi-threading article downloads" in the link above)

Is there any way to do this for pulling articles straight from a LIST of urls? That is, is there any way I can pump in multiple urls into the following set-up and have it download and parse them concurrently?

from newspaper import Article
url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
a = Article(url, language='zh') # Chinese
a.download()
a.parse()
print(a.text[:150])
Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Afflatus
  • 2,302
  • 5
  • 25
  • 40
  • What do you mean by "pulling articles straight from urls"? Are you trying to scrape and download all linked articles from a given URL? – Josep Valls May 25 '16 at 03:55
  • I just want to scrape the url provided for the article on the page. I want to be able to provide a set of urls so they can be downloaded in tandem. – Afflatus May 25 '16 at 03:56

4 Answers4

4

I was able to do this by creating a Source for each article URL. (disclaimer: not a python developer)

import newspaper

urls = [
  'http://www.baltimorenews.net/index.php/sid/234363921',
  'http://www.baltimorenews.net/index.php/sid/234323971',
  'http://www.atlantanews.net/index.php/sid/234323891',
  'http://www.wpbf.com/news/funeral-held-for-gabby-desouza/33874572',  
]

class SingleSource(newspaper.Source):
    def __init__(self, articleURL):
        super(StubSource, self).__init__("http://localhost")
        self.articles = [newspaper.Article(url=url)]

sources = [SingleSource(articleURL=u) for u in urls]

newspaper.news_pool.set(sources)
newspaper.news_pool.join()

for s in sources:
  print s.articles[0].html
Kyle Truscott
  • 1,537
  • 1
  • 12
  • 18
4

I know this question is really old but it's one of the first links that shows up when I googled how to get multithread newspaper. While Kyles answer is very helpful, it is not complete and I think it has some typos...

import newspaper

urls = [
'http://www.baltimorenews.net/index.php/sid/234363921',
'http://www.baltimorenews.net/index.php/sid/234323971',
'http://www.atlantanews.net/index.php/sid/234323891',
'http://www.wpbf.com/news/funeral-held-for-gabby-desouza/33874572',  
]

class SingleSource(newspaper.Source):
def __init__(self, articleURL):
    super(SingleSource, self).__init__("http://localhost")
    self.articles = [newspaper.Article(url=articleURL)]

sources = [SingleSource(articleURL=u) for u in urls]

newspaper.news_pool.set(sources)
newspaper.news_pool.join()

I changed the Stubsource to Singlesource and one of the urls to articleURL. Of course this just downloads the webpages, you still need to parse them to be able to get the text.

multi=[]
i=0
for s in sources:
    i+=1
    try:
        (s.articles[0]).parse()
        txt = (s.articles[0]).text
        multi.append(txt)
    except:
        pass

In my sample of 100 urls, this took half the time compared to just working with each url in sequence. (Edit: After increasing the sample size to 2000 there is a reduction of about a quarter.)

(Edit: Got the whole thing working with multithreading!) I used this very good explanation for my implementation. With a sample size of 100 urls, using 4 threads takes comparable time to the code above but increasing the thread count to 10 gives a further reduction of about a half. A larger sample size needs more threads to give a comparable difference.

import newspaper
from multiprocessing.dummy import Pool as ThreadPool

def getTxt(url):
    article = Article(url)
    article.download()
    try:
        article.parse()
        txt=article.text
        return txt
    except:
        return ""

pool = ThreadPool(10)

# open the urls in their own threads
# and return the results
results = pool.map(getTxt, urls)

# close the pool and wait for the work to finish 
pool.close() 
pool.join()
1

To build upon Joseph's Valls answer. I'm assuming the original poster wanted to use multithreading to extract a bunch of data and store it somewhere properly. After much trying, I think I have found a solution, it may not be the most efficient but it works, I've tried to make it better however, I think the newspaper3k plugin could be a bit buggy. However, this works in extracting the desired elements to a DataFrame.

import newspaper
from newspaper import Article
from newspaper import Source
import pandas as pd

gamespot_paper = newspaper.build('https://www.gamespot.com/news/', memoize_articles=False)
bbc_paper = newspaper.build("https://www.bbc.com/news", memoize_articles=False)
papers = [gamespot_paper, bbc_paper]
news_pool.set(papers, threads_per_source=4) 
news_pool.join()

#Create our final dataframe
df_articles = pd.DataFrame()

#Create a download limit per sources
limit = 100

for source in papers:
    #tempoary lists to store each element we want to extract
    list_title = []
    list_text = []
    list_source =[]

    count = 0

    for article_extract in source.articles:
        article_extract.parse()

        if count > limit:
            break

        #appending the elements we want to extract
        list_title.append(article_extract.title)
        list_text.append(article_extract.text)
        list_source.append(article_extract.source_url)

        #Update count
        count +=1


    df_temp = pd.DataFrame({'Title': list_title, 'Text': list_text, 'Source': list_source})
    #Append to the final DataFrame
    df_articles = df_articles.append(df_temp, ignore_index = True)
    print('source extracted')

Please do suggest any improvements!

blue_berry
  • 11
  • 2
0

I'm not familiar with the Newspaper module but the following code uses a list of URLs and should be equivalent to the one provided in the linked page:

import newspaper
from newspaper import news_pool

urls = ['http://slate.com','http://techcrunch.com','http://espn.com']
papers = [newspaper.build(i) for i in urls]
news_pool.set(papers, threads_per_source=2)
news_pool.join()
Josep Valls
  • 5,483
  • 2
  • 33
  • 67
  • This is what I meant by "In the tutorials, it describes how you can pool the building of different newspapers s.t. it generates them at the same time." I don't think this does what i want however. Specifically, I tried doing the same thing but with the urls of specific articles and i don't see how i can extract out the texts of the articles... if it was even downloaded. – Afflatus May 25 '16 at 15:07