I have been experimenting on python 2.7.3 to extract data from an RSS feed so I can output it to the python shell.
I found a video on you tube on the subject and copied the code from their tutorial.
The code works and extracts the data i need but the output of the data is set out wrong
the code is as follows:
from urllib2 import urlopen
import re
import cookielib
from cookielib import CookieJar
import time
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent','Mozilla/5.0')]
def main():
try:
page = 'http://thewebsite/feed/rs/rss.xml'
sourceCode = opener.open(page).read()
try:
titles = re.findall(r'<title>(.*?)</title>',sourceCode)
desc = re.findall(r'<description>(.*?)</description>',sourceCode)
links = re.findall(r'<link>(.*?)</link>',sourceCode)
pub = re.findall(r'<pubDate>(.*?)</pubDate>',sourceCode)
for title in (titles):
print title
for description in desc:
print description
for link in links:
print link
for pubDate in pub:
print pubDate
except Exception, e:
print str(e)
except Exception, e:
print str(e)
main()
The problem is that when I run the code I see a long list if titles, then a list of descriptions, then a list of links, then a list of dates.
like this:
The Title
The Title
The Title
the description
the description
the description
the link
the link
the link
the date
the date
the date
What I really want is the output showing
The Title
the description
the link
the date
The Title
the description
the link
the date
The Title
the description
the link
the date
Can anyone help with correcting the output?
I really want to just grab headlines from the RSS feed and output them to the shell.
The purpose is to use this on a raspberry pi instead of using the web browser.
Any help much appreciated.
**UPDATE***
Also the RSS feed I am parsing from has extra tags at the beginning which include elements called title and link. Any way these can be excluded???
Code starts with this:
****DONT WANT THIS. HAS TAGS I DONT WANT TO USE
<RSS>
<channel>
<title>DBE News</title>
<link>http://www.BDE.co.uk/news//#sa-ns_mchannel=rss&ns_source=PublicRSS20- sa</link>
<description>The latest stories from the BDE News web site. </description>
<language>en-gb</language>
<lastBuildDate>Fri, 17 Jan 2014 22:11:54 GMT</lastBuildDate>
<copyright>Copyright: DBE, see http://DBE.com for terms and conditions of reuse. </copyright>
<image>
<url>http://news.DBE.co.uk/nol/shared/img/DBE.gif</url>
<title>DBE News</title>
<link>http://www.DBE.co.uk/news/uk/#sa-ns_mchannel=rss&ns_source=PublicRSS20- sa</link>
<width>120</width>
<height>60</height>
</image>
<ttl>15</ttl>
<atom:link href="http://feeds.DBE.co.uk/news/uk/rss.xml" rel=" self" type="application/rss+xml"/>
****DONT WANT THIS. HAS TAGS I DONT WANT TO USE
****NORMAL FEED STARTS HERE. WANT ALL THE INFO HERE
<item>
<title>Stackover flow is cool</title>
<description>Get the answers here.</description>
<link>http://www.DBE.co.uk/news/Stsck# sa- ns_mchannel=rss&ns_source=PublicRSS20-sa</link>
<guid isPermaLink="false">http://www.DBE.co.uk/news/uStack</guid>
<pubDate>Fri, 17 Jan 2014 21:38:05 GMT</pubDate>
<media:thumbnail width="66" height="49" url="http://news.DBEimg.co.uk/media/images/72364000/jpg/_72364484_mikaeel.jpg"/>
<media:thumbnail width="144" height="81" url="http://news.DBEimg.co.uk/media/images/72364000/jpg/_72364485_mikaeel.jpg"/>
</item>
<item>
The extra tags cause a 'list index out of range' error.