Edit: so I came to realize, with the kind help from answers below, that parsing html with regex is generally a bad idea. For what it's worth, if someone else comes across my post someday with the same question, here's a link to two similar questions on this topic, with a far greater deal of debate and explanation that you might find useful: Using regular expressions to parse HTML: why not? and this one: RegEx match open tags except XHTML self-contained tags
Specs: Python 3.3.1
What I was trying to do: I was writing a web page extractor to grab the weather data from a website, which for my project has 3 meaningful sections: temperature "Right Now", "Earlier Today" and "Tonight". I intend to grab these 3 numbers only and leave out all other text. In the code below I used the presence of specific html elements preceding the temperature number as pattern to help me grab the number itself.
All the data I need is in this block of html code excerpt: (namely 89
,96
and 80
)
<div class="wx-timepart-title">
Earlier Today
</div>
<div class="wx-timepart-title">Tonight</div>
<div class="wx-data-part wx-first">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/29.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/30.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/29.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part wx-first">
<div class="wx-temperature"><span itemprop="temperature-fahrenheit">89</span><span class="wx-degrees">°<span class="wx-unit">F</span></span></div>
<div class="wx-temperature-label">FEELS LIKE
<span itemprop="feels-like-temperature-fahrenheit">94</span>°</div>
</div>
<div class="wx-data-part">
<div class="wx-temperature">96<span class="wx-degrees">°</span></div>
<div class="wx-temperature-label">HIGH AT 4:45 PM</div>
</div>
<div class="wx-data-part">
<div class="wx-temperature">80<span class="wx-degrees">°</span></div>
<div class="wx-temperature-label">LOW</div>
</div>
The solution I came up with:
import urllib.request
import re
# open the webpage and read the html code into a string;
base = urllib.request.urlopen('http://www.weather.com/weather/today/Washington+DC+USDC0001:1:US')
f = base.readlines()
f = str(f)
# temperature "Right Now"
match1 = re.search(r'<div class="wx-temperature"><span itemprop="temperature-fahrenheit">\w\w',f)
if match1:
result1 = match1.group()
right_now = result1[68:]
print(right_now)
# temperature "Earlier Today"
match2 = re.search(r'<div class="wx-temperature">\w\w',f)
if match2:
result2 = match2.group()
ealier_today = result2[28:]
print(ealier_today)
# temperature "Tonight"
match3 = re.search(r'<div class="wx-temperature">\w\w',f)
if match3:
result3 = match3.group()
tonight = result3[28:]
print(tonight)
The three print statements are just for testing if data was grabbed correctly.
My question: problem occurred when it came to the third regex(match3
), displaying the temperature for match2
. I figure it's because it uses the same regex pattern as the second? So I guess my question is that how do you search for multiple results with the same regex pattern. Or is it that you can only grab the first occurrence of a pattern? I'm quite new to Python and it's my first few days into regular expression. I appreciate it if you could share some general pointers about my solution, or about my general line of thinking towards this project. Thank you!