2

I am having a problem. I have a regular expression which is looking through a rss feed for weather

url = 'http://rss.weatherzone.com.au/?u=12994-1285&lt=aploc&lc=9388&obs=1&fc=1&warn=1'
weather_brisbane = urlopen(url)
html_code = weather_brisbane.read()
weather_brisbane.close()

I have a regex:

weather_contents = findall('<b>(.+)</b> (.*)', html_code)
if weather_contents != []:
    print 'Contents'
    for section_heading in weather_contents:
        print section_heading 
    print

I get this as a result:

Contents
('Temperature:', '20.1&#176;C\r')
('Feels like:', '20.1&#176;C<br />\r')
('Dew point:', '13.6&#176;C\r')
('Relative humidity:', '66%<br />\r')
('Wind:', 'E at 2 km/h, gusting to 4 km/h\r')
('Rain:', '0.0mm since 9am<br />\r')
('Pressure:', '1024.9 hPa\r')​

So my question is, is there a way to get this result:

Contents
Temperature: 20.1
Feels like: 20.1
Dew point: 13.6
Relative humidity: 66%
Wind: E at 2 km/h, gusting to 4 km/h
Rain: 0.0mm since 9am
Pressure: 1024.9 hPa

By integrating a strip() function into the already existing code.

  • 1
    what exactly have you tried so far? SO is all about self-research... not about others doing your coding for you – NirMH May 19 '14 at 08:50
  • Well to be honest I'm quiet stumped and I have no idea how to get refine my result. I am planning to make a GUI for weather but I cannot get rid of this extra content. I looked up that strip() is used for taking out things in strings but I'm not very familiar in how to use it. – user3651791 May 19 '14 at 10:20
  • `(.+) (.*)` Please don't :/ What happens when you have multiple `` tags on one line? See the many related questions to HTML parsing with regex (discussion [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) for example): can be done on very limited well defined cases (but you need something stronger than your current one), but since you need special trimming might as well use a parser. – Robin May 19 '14 at 10:39

3 Answers3

1

The otuput you are getting seems to html encoded.

Using a html decocer will make it: Decode HTML entities in Python string?

So use this code:

from HTMLParser import HTMLParser
h = HTMLParser()
weather_contents = findall('<b>(.+)</b> (.*)', html_code)
if weather_contents != []:
    print 'Contents'
    for section_heading in weather_contents:
        print section_heading[0], h.unescape(section_heading[1]) 
    print

I think this will display what you want to display.

Community
  • 1
  • 1
David Mabodo
  • 745
  • 5
  • 16
  • 1
    Might want to use the HTML parser to, well, parse HTML with it, instead of using the original regex :) – Robin May 19 '14 at 10:40
1

There is an alternative of HTMLParser:

print ' '.join([s.rstrip('\r').rsplit('<br />')[0].rsplit('&#176;C')[0] for s in section_heading])

instead of

print section_heading
Vasily Ryabov
  • 9,386
  • 6
  • 25
  • 78
0
weather_contents = [x.replace('&#176;C', "C") for x in weather_contents]

this should help refine your weather_contents

Zenziba
  • 19
  • 4