3

I'm a newbie to Python and programming in general so please excuse me if the question is very dumb.

I've been following this tutorial on RSS scraping step by step but I am getting a "list index out of range" error from Python when trying to gather the corresponding links to the titles of the articles being gathered.

Here is my code:

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

source  = urlopen('http://feeds.huffingtonpost.com/huffingtonpost/raw_feed').read()

title = re.compile('<title>(.*)</title>')
link = re.compile('<link>(.*)</link>')

find_title = re.findall(title, source)
find_link = re.findall(link, source)

literate = []
literate[:] = range(1, 16)

for i in literate:
    print find_title[i]
    print find_link[i]

It executes fine when I only tell it to retrieve titles, but immediately throws an index error when I would like to retrieve titles and their corresponding links.

Any assistance will be greatly appreciated.

user1205632
  • 43
  • 1
  • 4

2 Answers2

7

You could use feedparser module to parse an RSS feed from a given url:

#!/usr/bin/env python
import feedparser # pip install feedparser

d = feedparser.parse('http://feeds.huffingtonpost.com/huffingtonpost/latestnews')
# .. skipped handling http errors, cacheing ..

for e in d.entries:
    print(e.title)
    print(e.link)
    print(e.description)
    print("\n") # 2 newlines

Output

Even Critics Of Safety Net Increasingly Depend On It
http://www.huffingtonpost.com/2012/02/12/safety-net-benefits_n_1271867.html
<p>Ki Gulbranson owns a logo apparel shop, deals in 
<!-- ... snip ... -->

Christopher Cain, Atlanta Anti-Gay Attack Suspect, Arrested And
Charged With Aggravated Assault And Robbery
http://www.huffingtonpost.com/2012/02/12/atlanta-anti-gay-suspect-christopher-cain-arrested_n_1271811.html
<p>ATLANTA -- Atlanta police have arrested a suspect 
<!-- ... snip ... -->

It might not be a good idea to use regular expressions to parse rss(xml).

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
1

I think you are using a wrong regex for extracting link from your page.

>>> link = re.compile('<link rel="alternate" type="text/html" href=(.*)')
>>> find_link = re.findall(link, source)
>>> find_link[1].strip()
'"http://www.huffingtonpost.com/andrew-brandt/the-peyton-predicament-pa_b_1271834.html" />'
>>> len(find_link)
15
>>>

Take a look at html source of your page you will find that the links are not enclosed in <link></link> pattern.

Actually the pattern is <link rel="alternate" type="text/html" href= links here.

That's the reason why your regex is not working.

RanRag
  • 48,359
  • 38
  • 114
  • 167