0

Edit: so I came to realize, with the kind help from answers below, that parsing html with regex is generally a bad idea. For what it's worth, if someone else comes across my post someday with the same question, here's a link to two similar questions on this topic, with a far greater deal of debate and explanation that you might find useful: Using regular expressions to parse HTML: why not? and this one: RegEx match open tags except XHTML self-contained tags

Specs: Python 3.3.1

What I was trying to do: I was writing a web page extractor to grab the weather data from a website, which for my project has 3 meaningful sections: temperature "Right Now", "Earlier Today" and "Tonight". I intend to grab these 3 numbers only and leave out all other text. In the code below I used the presence of specific html elements preceding the temperature number as pattern to help me grab the number itself.

All the data I need is in this block of html code excerpt: (namely 89,96 and 80)

<div class="wx-timepart-title">
Earlier Today
</div>
<div class="wx-timepart-title">Tonight</div>
<div class="wx-data-part wx-first">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/29.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/30.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/29.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part wx-first">
<div class="wx-temperature"><span itemprop="temperature-fahrenheit">89</span><span class="wx-degrees">&deg;<span class="wx-unit">F</span></span></div>
<div class="wx-temperature-label">FEELS LIKE
<span itemprop="feels-like-temperature-fahrenheit">94</span>&deg;</div>
</div>
<div class="wx-data-part">
<div class="wx-temperature">96<span class="wx-degrees">&deg;</span></div>
<div class="wx-temperature-label">HIGH AT 4:45 PM</div>
</div>
<div class="wx-data-part">
<div class="wx-temperature">80<span class="wx-degrees">&deg;</span></div>
<div class="wx-temperature-label">LOW</div>
</div>  

The solution I came up with:

import urllib.request
import re

# open the webpage and read the html code into a string; 
base = urllib.request.urlopen('http://www.weather.com/weather/today/Washington+DC+USDC0001:1:US')
f = base.readlines()
f = str(f)


# temperature "Right Now" 
match1 = re.search(r'<div class="wx-temperature"><span itemprop="temperature-fahrenheit">\w\w',f)

if match1:
    result1 = match1.group()
    right_now = result1[68:]
    print(right_now)


# temperature "Earlier Today"
match2 = re.search(r'<div class="wx-temperature">\w\w',f)

if match2:
    result2 = match2.group()
    ealier_today = result2[28:]
    print(ealier_today)


# temperature "Tonight"
match3 = re.search(r'<div class="wx-temperature">\w\w',f)

if match3:
    result3 = match3.group()
    tonight = result3[28:]
    print(tonight)

The three print statements are just for testing if data was grabbed correctly.

My question: problem occurred when it came to the third regex(match3), displaying the temperature for match2. I figure it's because it uses the same regex pattern as the second? So I guess my question is that how do you search for multiple results with the same regex pattern. Or is it that you can only grab the first occurrence of a pattern? I'm quite new to Python and it's my first few days into regular expression. I appreciate it if you could share some general pointers about my solution, or about my general line of thinking towards this project. Thank you!

Community
  • 1
  • 1
hakuna121
  • 243
  • 5
  • 10
  • 18

1 Answers1

1

Perhaps I misunderstand your question, but you are merely looking for findall?

match3 = re.findall(r'<div class="wx-temperature">\w\w',f)

Also, you might find it easier to use BeautifulSoup or something along those lines. Parsing html with regexes is hellish. Further, you might as well not reinvent the wheel, since python has hundreds of well-built modules that have already done a lot of work for you. You could do the following after installing bs4:

>>> from bs4 import BeautifulSoup
>>> html = '''<div class="wx-timepart-title">
Earlier Today
</div>
<div class="wx-timepart-title">Tonight</div>
<div class="wx-data-part wx-first">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/29.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/30.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/29.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part wx-first">
<div class="wx-temperature"><span itemprop="temperature-fahrenheit">89</span><span class="wx-degrees">&deg;<span class="wx-unit">F</span></span></div>
<div class="wx-temperature-label">FEELS LIKE
<span itemprop="feels-like-temperature-fahrenheit">94</span>&deg;</div>
</div>
<div class="wx-data-part">
<div class="wx-temperature">96<span class="wx-degrees">&deg;</span></div>
<div class="wx-temperature-label">HIGH AT 4:45 PM</div>
</div>
<div class="wx-data-part">
<div class="wx-temperature">80<span class="wx-degrees">&deg;</span></div>
<div class="wx-temperature-label">LOW</div>
</div>  '''
>>> soup = BeautifulSoup(html)
>>> for temp in soup.find_all(class_="wx-temperature"):
    print(temp.text)       # or add these to a list or make a list comprehension


89°F
96°
80°

If you merely want the digits (and possibly a negative), you can do this:

>>> import re
>>> for temp in soup.find_all(class_="wx-temperature"):
    print(re.match(r'-?\d+', temp.text).group())


89
96
80

This approach would give you some flexibility in case the weather ever drops to one digit or goes up to three digits. I added the -?, which means 0 or 1 occurrences of the character -, in case you run across negative temps.

Justin O Barber
  • 11,291
  • 2
  • 40
  • 45