Removing part of a string that has been obtained by a regular expression using strip functions

Question

I am having a problem. I have a regular expression which is looking through a rss feed for weather

url = 'http://rss.weatherzone.com.au/?u=12994-1285&lt=aploc&lc=9388&obs=1&fc=1&warn=1'
weather_brisbane = urlopen(url)
html_code = weather_brisbane.read()
weather_brisbane.close()

I have a regex:

weather_contents = findall('<b>(.+)</b> (.*)', html_code)
if weather_contents != []:
    print 'Contents'
    for section_heading in weather_contents:
        print section_heading 
    print

I get this as a result:

Contents
('Temperature:', '20.1&#176;C\r')
('Feels like:', '20.1&#176;C<br />\r')
('Dew point:', '13.6&#176;C\r')
('Relative humidity:', '66%<br />\r')
('Wind:', 'E at 2 km/h, gusting to 4 km/h\r')
('Rain:', '0.0mm since 9am<br />\r')
('Pressure:', '1024.9 hPa\r')

So my question is, is there a way to get this result:

Contents
Temperature: 20.1
Feels like: 20.1
Dew point: 13.6
Relative humidity: 66%
Wind: E at 2 km/h, gusting to 4 km/h
Rain: 0.0mm since 9am
Pressure: 1024.9 hPa

By integrating a strip() function into the already existing code.

what exactly have you tried so far? SO is all about self-research... not about others doing your coding for you — NirMH, May 19 '14 at 08:50
Well to be honest I'm quiet stumped and I have no idea how to get refine my result. I am planning to make a GUI for weather but I cannot get rid of this extra content. I looked up that strip() is used for taking out things in strings but I'm not very familiar in how to use it. — user3651791, May 19 '14 at 10:20
`(.+) (.*)` Please don't :/ What happens when you have multiple `` tags on one line? See the many related questions to HTML parsing with regex (discussion [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) for example): can be done on very limited well defined cases (but you need something stronger than your current one), but since you need special trimming might as well use a parser. — Robin, May 19 '14 at 10:39

score 1 · Answer 1 · edited May 23 '17 at 12:06

1

The otuput you are getting seems to html encoded.

Using a html decocer will make it: Decode HTML entities in Python string?

So use this code:

from HTMLParser import HTMLParser
h = HTMLParser()
weather_contents = findall('<b>(.+)</b> (.*)', html_code)
if weather_contents != []:
    print 'Contents'
    for section_heading in weather_contents:
        print section_heading[0], h.unescape(section_heading[1]) 
    print

I think this will display what you want to display.

edited May 23 '17 at 12:06

Community

1
1

answered May 19 '14 at 08:36

David Mabodo

745
5
16

1

Might want to use the HTML parser to, well, parse HTML with it, instead of using the original regex :) – Robin May 19 '14 at 10:40

score 1 · Accepted Answer · answered May 19 '14 at 10:55

1

There is an alternative of HTMLParser:

print ' '.join([s.rstrip('\r').rsplit('<br />')[0].rsplit('&#176;C')[0] for s in section_heading])

instead of

print section_heading

answered May 19 '14 at 10:55

Vasily Ryabov

9,386
6
25
78

score 0 · Answer 3 · answered May 19 '14 at 08:39

0

weather_contents = [x.replace('&#176;C', "C") for x in weather_contents]

this should help refine your weather_contents

answered May 19 '14 at 08:39

Zenziba

19
4

Removing part of a string that has been obtained by a regular expression using strip functions

3 Answers3