Python - Scraping website data - formatting printed results

Question

I have been experimenting on python 2.7.3 to extract data from an RSS feed so I can output it to the python shell.

I found a video on you tube on the subject and copied the code from their tutorial.

The code works and extracts the data i need but the output of the data is set out wrong

the code is as follows:

from urllib2 import urlopen
import re
import cookielib
from cookielib import CookieJar
import time

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent','Mozilla/5.0')]


def main():
    try:
        page = 'http://thewebsite/feed/rs/rss.xml'
        sourceCode = opener.open(page).read()

        try:
            titles = re.findall(r'<title>(.*?)</title>',sourceCode)
            desc = re.findall(r'<description>(.*?)</description>',sourceCode)
            links = re.findall(r'<link>(.*?)</link>',sourceCode)
            pub = re.findall(r'<pubDate>(.*?)</pubDate>',sourceCode)


            for title in (titles):
                print title

            for description in desc:
                print description

            for link in links:
                print link

            for pubDate in pub:
                print pubDate

        except Exception, e:
            print str(e)


    except Exception, e:
        print str(e)

main()

The problem is that when I run the code I see a long list if titles, then a list of descriptions, then a list of links, then a list of dates.

like this:

The Title
The Title
The Title
the description
the description
the description
the link
the link
the link
the date
the date
the date

What I really want is the output showing

The Title
the description
the link
the date

The Title
the description
the link
the date

The Title
the description
the link
the date

Can anyone help with correcting the output?

I really want to just grab headlines from the RSS feed and output them to the shell.

The purpose is to use this on a raspberry pi instead of using the web browser.

Any help much appreciated.

**UPDATE***

Also the RSS feed I am parsing from has extra tags at the beginning which include elements called title and link. Any way these can be excluded???

Code starts with this:

****DONT WANT THIS. HAS TAGS I DONT WANT TO USE

    <RSS>    
      <channel> 
         <title>DBE News</title>  
        <link>http://www.BDE.co.uk/news//#sa-ns_mchannel=rss&amp;ns_source=PublicRSS20-            sa</link>  
         <description>The latest stories from the BDE News web site.     </description>  
         <language>en-gb</language>  
        <lastBuildDate>Fri, 17 Jan 2014 22:11:54 GMT</lastBuildDate>  
        <copyright>Copyright: DBE, see   http://DBE.com for terms and conditions of reuse.   </copyright>  
       <image> 
          <url>http://news.DBE.co.uk/nol/shared/img/DBE.gif</url>  
          <title>DBE News</title>  
           <link>http://www.DBE.co.uk/news/uk/#sa-ns_mchannel=rss&amp;ns_source=PublicRSS20- sa</link>  
          <width>120</width>  
          <height>60</height> 
         </image>  
         <ttl>15</ttl>  
        <atom:link href="http://feeds.DBE.co.uk/news/uk/rss.xml" rel=" self"       type="application/rss+xml"/>

****DONT WANT THIS. HAS TAGS I DONT WANT TO USE

****NORMAL FEED STARTS HERE. WANT ALL THE INFO HERE

    <item> 
      <title>Stackover flow is cool</title>  
      <description>Get the answers here.</description>  
      <link>http://www.DBE.co.uk/news/Stsck# sa-  ns_mchannel=rss&amp;ns_source=PublicRSS20-sa</link>  
      <guid isPermaLink="false">http://www.DBE.co.uk/news/uStack</guid>  
      <pubDate>Fri, 17 Jan 2014 21:38:05 GMT</pubDate>  
      <media:thumbnail width="66" height="49"     url="http://news.DBEimg.co.uk/media/images/72364000/jpg/_72364484_mikaeel.jpg"/>  
      <media:thumbnail width="144" height="81"   url="http://news.DBEimg.co.uk/media/images/72364000/jpg/_72364485_mikaeel.jpg"/> 
    </item>  
   <item>

The extra tags cause a 'list index out of range' error.

You could parse the xml as data and do something like: http://stackoverflow.com/questions/2148119/how-to-convert-an-xml-string-to-a-dictionary-in-python — brandonscript, Jan 17 '14 at 21:50
I wrote some code to do this at first but the results returned no data. This code worked on another XML file i had downloaded using python. — Zeeman, Jan 17 '14 at 21:57
[It's not a good idea to use regular expressions to parse xml](http://stackoverflow.com/questions/21196624/hide-html-code-temporarily-and-display-them-later/21196742#21196742); you'd be best to work out a proper decoding like that and implement a dict/enumerator or do what @jramirez suggests afterwards. — brandonscript, Jan 17 '14 at 22:03

jramirez · Answer 1 · 2014-01-17T21:59:39.353

1

You could do something like this. Note that this will only work if there always the same amount of elements in all four lists. If say there are six elements in titles and three in desc you will get an IndexError exception.

        for i,title in enumerate(titles):
            print title
            print desc[i]
            print links[i]
            print pub[i]
            print ""

edited Jan 17 '14 at 21:59

answered Jan 17 '14 at 21:49

jramirez

8,537
7
33
46

score 0 · Accepted Answer · answered Jan 17 '14 at 21:49

0

Try replacing your 4 for loops with this:

for i in range(len(titles)):
    print titles[i]
    print desc[i]
    print links[i]
    print pub[i]
    print ""

answered Jan 17 '14 at 21:49

Python - Scraping website data - formatting printed results

2 Answers2