4

I'm learning python. As practice I'm building a rss scraper with feedparser putting the output into a pandas dataframe and trying to mine with NLTK...but I'm first getting a list of articles from multiple RSS feeds.

I used this post on how to pass multiple feeds and combined it with an answer I got previously to another question on how to get it into a Pandas dataframe.

What the problem is, I want to be able to see the data from all the feeds in my dataframe. Currently I'm only able to access the first item in the list of feeds.

FeedParser seems to be doing it's job but when putting it into the Pandas df it only seems to grab the first RSS in the list.

import feedparser
import pandas as pd

rawrss = [
    'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
    'https://www.yahoo.com/news/rss/',
    'http://www.huffingtonpost.co.uk/feeds/index.xml',
    'http://feeds.feedburner.com/TechCrunch/',
    ]

feeds = []
for url in rawrss:
    feeds.append(feedparser.parse(url))

for feed in feeds:
    for post in feed.entries:
        print(post.title, post.link, post.summary)

df = pd.DataFrame(columns=['title', 'link', 'summary'])

for i, post in enumerate(feed.entries):
    df.loc[i] =  post.title, post.link, post.summary

df.shape

df
Nick Duddy
  • 910
  • 6
  • 20
  • 36

2 Answers2

13

Your code will loop through each post and print its data. The part of your code that adds the post data to the dataframe is not part of the loop (in python indentation is meaningful!), so you only see the data from one feed in your dataframe.

You can build a list of posts as you loop through the feeds, and then create a dataframe at the end:

import feedparser
import pandas as pd

rawrss = [
    'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
    'https://www.yahoo.com/news/rss/',
    'http://www.huffingtonpost.co.uk/feeds/index.xml',
    'http://feeds.feedburner.com/TechCrunch/',
    ]

feeds = [] # list of feed objects
for url in rawrss:
    feeds.append(feedparser.parse(url))

posts = [] # list of posts [(title1, link1, summary1), (title2, link2, summary2) ... ]
for feed in feeds:
    for post in feed.entries:
        posts.append((post.title, post.link, post.summary))

df = pd.DataFrame(posts, columns=['title', 'link', 'summary']) # pass data to init

You could optimize this a little bit by combining the two for loops:

posts = []
for url in rawrss:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((post.title, post.link, post.summary))
beenjaminnn
  • 752
  • 1
  • 11
  • 23
  • Thanks, that works perfectly. Also properly showed me about indentation and it's effects. Also thanks for the optimised version, I see and understand what you did there and I'm using that instead. – Nick Duddy Aug 17 '17 at 12:18
  • @beenjaminnn What would be the most optimal way in the second part of your code to check of the URL already exists in your list, and only append if the URL is unique? – Pythoner Oct 14 '19 at 14:14
  • I'm assuming you mean only append the post if the link does not exist in the `posts` list? I would keep the list of titles, links and summaries separate and then check `if post.link in links` . – beenjaminnn Oct 14 '19 at 17:54
0

I'm using dict to build DataFrame:

import feedparser
import pandas as pd

rawrss = [
    'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml',
    'https://www.yahoo.com/news/rss/',
    'http://www.huffingtonpost.co.uk/feeds/index.xml',
    'http://feeds.feedburner.com/TechCrunch/',
    ]

df = pd.DataFrame([])

for url in rawrss:
    dp = feedparser.parse(url)

    for i, e in enumerate(dp.entries):
        one_feed = {}
        one_feed['etitle'] = e.title if 'title' in e else f'title {i}'
        one_feed['summary'] = e.summary if 'summary' in e else f'no summary {i}'
        one_feed['elink'] = e.link if 'link' in e else f'link {i}'
        one_feed['published'] = e.published if 'published' in e else f'no published {i}'
        one_feed['elink_img'] = e.links[1].href if 'links' in e and len(e.links)>1 else f'no link_img {i}'

        df = df.append(pd.DataFrame([one_feed]), ignore_index=True)

It's easier to add columns this way.

S.B
  • 13,077
  • 10
  • 22
  • 49