Stripping content using regular expressions in python

Question

I'm trying to solely use the re module to extract text from an rss feed. So far i've extracted the description using findall but i don't know where to go from here. So far i've written:

url = 'http://www.theguardian.com/sport/rss'
open_page = urlopen(url)
html_code = open_page.read()
open_page.close()

descriptions = re.findall(r'<description>(.*?)</description>',html_code)

for description in descriptions:
    if 'Latest news and features from theguardian.com' in description:
        pass
    else:
        print "Description:" ,description

This code gives the following output:

Description: Wales 0-0 Bosnia-Herzegovina&lt;p&gt;It was not &lt;a href="http://www.theguardian.com/football/2014/oct/09/wales-bosnia-chris-coleman-euro-2016-qualifier" title=""&gt;the victory that Chris Coleman, his players and the home supporters craved&lt;/a&gt; to ignite hopes of qualifying for the European Championships in France but this may well turn out to be a precious point for Wales. Ashley Williams and Hal Robson-Kanu will have sleepless nights about the glorious chances they squandered late on but at the other end of the pitch it was impossible to overlook the outstanding contribution Wayne Hennessey made in goal.&lt;/p&gt;&lt;p&gt;Unable to get into the Crystal Palace team at the moment, Hennessey produced half a dozen crucial stops here, including a triple save early in the second half and  perhaps most memorably of all  flicked Miralem Pjanics 30-yard free-kick over the bar eight minutes from time, when the Bosnia playmaker looked to have found the top corner.&lt;/p&gt; &lt;a href="http://www.theguardian.com/football/2014/oct/10/wales-bosnia-herzegovina-euro-2016-qualifying"&gt;Continue reading...&lt;/a&gt;

I was wondering what regular expressions could i use to take all the tags out of this and leave plain text (a few sentences at the most). Can anyone help me out?

Also i understand it would be easier to use beautifulsoup or htmlparser but i'm just trying to use re.

Just a few sentences from the text (a description). e.g. "Ashley Williams and Hal Robson-Kanu will have sleepless nights about the glorious chances they squandered late on but at the other end of the pitch it was impossible to overlook the outstanding contribution Wayne Hennessey made in goal." — user2747367, Oct 11 '14 at 06:20

score 1 · Answer 1 · answered Oct 11 '14 at 06:22

The problem is that there is an HTML code inside every description tag.

Here how you can find all description tags using BeautifulSoup, load them into separate BeautifulSoup objects and get the text:

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://www.theguardian.com/sport/rss'
soup = BeautifulSoup(urlopen(url))

for description in soup.find_all('description'):
    print BeautifulSoup(description.text).text

Prints:

Latest news and features from theguardian.com, the world's leading liberal voice
Raheem Sterling and Calum Chambers making senior mark Players dont reach their best until theyre 27 or 28 Euro 2016 qualifier match report: England 5-0 San MarinoRoy Hodgson has admitted his successor as England manager may be the chief beneficiary of the crop of young players already making their mark in the senior team as the national set-up makes plans beyond the 2016 European Championships.The squad travel to Estonia on Saturday before their latest qualifying game having established themselves at the top of Group E and with a number of bright young things seizing their opportunity to establish credentials at the higher level. The team will be tested sternly in prestigious friendly fixtures over the next two years, with Italy confirmed as opponents next March, likely to be played in Turin, and negotiations close to conclusion to play France at the Stade de France, either in November 2015 or the March before the tournament. Continue reading...
...

nu11p01n73R · Accepted Answer · 2014-10-11T08:52:21.763

1

Your regex is fine. All you need to do is to get rid of all the tags within your description as well. The re.sub function can help you on this

>>>re.sub("<.*?>","","<h1>heading</h1>")
 heading

here <.?*> matches any html tag, and replace it with ""

The code can be edited as

url = 'http://www.theguardian.com/sport/rss'
open_page = urlopen(url)
html_code = open_page.read()
open_page.close()

descriptions = re.findall(r'<description>(.*?)</description>',html_code)


for description in descriptions:
    if 'Latest news and features from theguardian.com' in description:
        pass
    else:

        #edited here
        cont = re.sub("&lt.*?&gt","",description)

        print "Description:" ,cont

since the re.findall formats the input string by replacing < with &lt use cont = re.sub("&lt.*?&gt","",description)

will produce output as

    Description: Wales 0-0 Bosnia-HerzegovinaIt was not the victory that Chris Coleman, his players and the home 
supporters craved to ignite hopes of qualifying for the European Championships in France but this may well turn out to 
be a precious point for Wales. Ashley Williams and Hal Robson-Kanu will have sleepless nights about the glorious chances 
they squandered late on but at the other end of the pitch it was impossible to overlook the outstanding contribution 
Wayne Hennessey made in goal.Unable to get into the Crystal Palace team at the moment, Hennessey produced half a dozen 
crucial stops here, including a triple save early in the second half and perhaps most memorably of all flicked Miralem 
Pjanics 30-yard free-kick over the bar eight minutes from time, when the Bosnia playmaker looked to have found the top 
corner. Continue reading...

edited Oct 11 '14 at 08:52

answered Oct 11 '14 at 06:37

nu11p01n73R

26,397
3
39
52

i tried using this solution but got the following error: File "C:\Python27\lib\re.py", line 151, in sub return _compile(pattern, flags).sub(repl, string, count) TypeError: expected string or buffer – user2747367 Oct 11 '14 at 07:06
its because you passed `count` to `re.sub` wchich is not string. In my example i used `content` which is the string returned by `findall` – nu11p01n73R Oct 11 '14 at 07:11
The problem im having is when i print the text to python, the
and brackets all come out as "p>" and "<",
– user2747367 Oct 11 '14 at 08:09
but in your question it all prints correct as `
`. I havn't tested the `urlopen` path, but only the `re.sub` match wich performs correctly.
– nu11p01n73R Oct 11 '14 at 08:33
I think thats because it converts the tags back when i post it here, any idea how i can fix this? – user2747367 Oct 11 '14 at 08:34
am not sure how to convert back. but we have a way around by replacing `&lt` instead of `<`. I have edited the asnwer. – nu11p01n73R Oct 11 '14 at 08:53
That Looks amazing thankyou! the only remaining problem is there is a lot of semi colons which would ideal replaced as full stops, is there any way to substitute these? here is sample text from the most recent print out: "Raheem Sterling and Calum Chambers making senior mark; Players dont reach their best until theyre 27 or 28; ;Euro 2016 qualifier match report: England 5-0 San Marino;;" – user2747367 Oct 11 '14 at 09:25
simply use `replace` function to replace to some null character. before printing, insert a line, `cont.replace(';','')` which will replace every `;`. Also if you find the answer useful kindly accpet the answer so that it may be useful for some other as well thank you – nu11p01n73R Oct 11 '14 at 09:34
@user2747367 sorry, but regex is really a bad way to parse HTML, see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. – alecxe Oct 11 '14 at 16:05

score 0 · Answer 3 · answered Oct 11 '14 at 08:55

0

<[^>]*>

Try this.You can use re.sub.Replace by empty string.See demo.

http://regex101.com/r/vR4fY4/9

answered Oct 11 '14 at 08:55

vks

67,027
10
91
124

Stripping content using regular expressions in python

3 Answers3