3

Dear Stackoverflow community!

I would like to scrape news articles from the CNN RSS feed and get the link for each scraped article. This workes very well with the Python NewsPaper library, but unfortunately I am unable to get the output in a usable format i.e. a list or a dictionary.

I want to add the scraped links into one SINGLE list, instead of many separated lists.

    import feedparser as fp
    import newspaper
    from newspaper import Article

    website = {"cnn": {"link": "http://edition.cnn.com/", "rss": "http://rss.cnn.com/rss/cnn_topstories.rss"}}

    for source, value in website.items():
        if 'rss' in value:
            d = fp.parse(value['rss']) 
            #if there is an RSS value for a company, it will be extracted into d

            for entry in d.entries:
                if hasattr(entry, 'published'):
                    article = {}
                    article['link'] = entry.link
                    print(article['link'])

The output is as follows:

http://rss.cnn.com/~r/rss/cnn_topstories/~3/5aHaFHz2VtI/index.html
http://rss.cnn.com/~r/rss/cnn_topstories/~3/_O8rud1qEXA/joe-walsh-trump-gop-voters-sot-crn-vpx.cnn
http://rss.cnn.com/~r/rss/cnn_topstories/~3/xj-0PnZ_LwU/index.html
.......

I would like to have ONE list with all the links in it i.e:

    list =[http://rss.cnn.com/~r/rss/cnn_topstories/~3/5aHaFHz2VtI/index.html , http://rss.cnn.com/~r/rss/cnn_topstories/~3/_O8rud1qEXA/joe-walsh-trump-gop-voters-sot-crn-vpx.cnn , http://rss.cnn.com/~r/rss/cnn_topstories/~3/xj-0PnZ_LwU/index.html ,... ]

I tried appending the content via a for loop as follows:

    for i in article['link']:
        article_list = []
        article_list.append(i)
        print(article_list)

But then the output is like this:

['h']
['t']
['t']
['p']
[':']
['/']
['/']
['r']
['s']
...

Does anyone know an alternative method, how to get the content into one list? Or alternatively a dictionary as following:

    dict = {'links':[link1 , link2 , link 3]}

Thank you VERY much in advance for your help!!

Mercury
  • 37
  • 1
  • 5

1 Answers1

0

Try modifying your code like this and see if it works:

article_list = []
for entry in d.entries:
            if hasattr(entry, 'published'):
                article = {}
                article['link'] = entry.link
                article_list.append(article['link'])
Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
  • YES! This actually worked! Thank you so much - I spent hours on trying to figure this out. Have a nice day! – Mercury Oct 22 '19 at 06:29
  • @Mercury - Glad it worked for you! Don't forget to accept the answer. – Jack Fleeting Oct 22 '19 at 10:18
  • Hi there again! Unfortunately, as soon as I change the source to multiple rss links, it only appends the second ones to the list. For example, I added the CNBC link: website = {"cnn": {"link": "http://edition.cnn.com/", "rss": "http://rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "https://www.cnbc.com/", "rss": "https://www.cnbc.com/id/10000664/device/rss/rss.html"}} And now the list is only filled with CNBC links. Do wou know maybe how I can solce this problem, and append all the links from all the sources? Thank you in advance!! Ps. Answer is accepted! :-) – Mercury Oct 22 '19 at 11:42
  • I suspect it has something to do with the definition of `website` in the case of multiple links. The best way to handle it (and probably what SO prefers in these cases) is to post the new issue as a different question with the code you use. This will also add visibility to the new problem and allow others to chime in. – Jack Fleeting Oct 22 '19 at 12:33
  • Ok, I will post a new question regarding this issue. Thanks a lot! – Mercury Oct 23 '19 at 10:09