I'm trying to solely use the re module to extract text from an rss feed. So far i've extracted the description using findall but i don't know where to go from here. So far i've written:
url = 'http://www.theguardian.com/sport/rss'
open_page = urlopen(url)
html_code = open_page.read()
open_page.close()
descriptions = re.findall(r'<description>(.*?)</description>',html_code)
for description in descriptions:
if 'Latest news and features from theguardian.com' in description:
pass
else:
print "Description:" ,description
This code gives the following output:
Description: Wales 0-0 Bosnia-Herzegovina<p>It was not <a href="http://www.theguardian.com/football/2014/oct/09/wales-bosnia-chris-coleman-euro-2016-qualifier" title="">the victory that Chris Coleman, his players and the home supporters craved</a> to ignite hopes of qualifying for the European Championships in France but this may well turn out to be a precious point for Wales. Ashley Williams and Hal Robson-Kanu will have sleepless nights about the glorious chances they squandered late on but at the other end of the pitch it was impossible to overlook the outstanding contribution Wayne Hennessey made in goal.</p><p>Unable to get into the Crystal Palace team at the moment, Hennessey produced half a dozen crucial stops here, including a triple save early in the second half and perhaps most memorably of all flicked Miralem Pjanics 30-yard free-kick over the bar eight minutes from time, when the Bosnia playmaker looked to have found the top corner.</p> <a href="http://www.theguardian.com/football/2014/oct/10/wales-bosnia-herzegovina-euro-2016-qualifying">Continue reading...</a>
I was wondering what regular expressions could i use to take all the tags out of this and leave plain text (a few sentences at the most). Can anyone help me out?
Also i understand it would be easier to use beautifulsoup or htmlparser but i'm just trying to use re.