0

I am using line.rfind() to find a certain line in an html page and then I am splitting the line to pull out individual numbers. For example:

position1 = line.rfind('Wed')

This finds this particular line of html code:

 <strong class="temp">79<span>&deg;</span></strong><span class="low"><span>Lo</span> 56<span>&deg;</span></span>

First I want to pull out the '79', which is done with the following code:

if position1 > 0 :
        self.high0 = lines[line_number + 4].split('<span>')[0].split('">')[-1]

This works perfectly. The problem I am encountering is trying to extract the '56' from that line of html code. I can't split it between '< span>' and '< /span> since the first '< span>' it finds in the line is after the '79'. Is there a way to tell the script to look for the second occurrence of '< span>'?

Thanks for your help!

hunter21188
  • 405
  • 2
  • 7
  • 29

2 Answers2

2

Concerns about parsing HTML with regex aside, I've found that regex tends to be fairly useful for grabbing information from limited, machine-generated HTML.

You can pull out both values with a regex like this:

import re
matches = re.findall(r'<strong class="temp">(\d+).*?<span>Lo</span> (\d+)', lines[line_number+4])
if matches:
    high, low = matches[0]

Consider this quick-and-dirty: if you rely on it for a job, you may want to use a real parser like BeautifulSoup.

Community
  • 1
  • 1
nneonneo
  • 171,345
  • 36
  • 312
  • 383
  • Awesome. Thank you. This is just for my own purposes, nothing important. Though I may check out BeautifulSoup anyway. Thanks again. – hunter21188 Sep 11 '13 at 04:13
1
import re

html = """
 <strong class="temp">79<span>&deg;</span></strong><span class="low"><span>Lo</span> 56<span>&deg;</span></span>
"""

numbers = re.findall(r"\d+", html, re.X|re.M|re.S)
print numbers

--output:--
['79', '56']

With BeautifulSoup:

from bs4 import BeautifulSoup

html = """
<strong class="temp">
    79
    <span>&deg;</span>
</strong>
<span class="low">
   <span>Lo</span> 
   56
   <span>&deg;</span>
</span>
"""

soup = BeautifulSoup(html)
low_span = soup.find('span', class_="low")

for string in low_span.stripped_strings:
    print string

--output:--
Lo
56
°
7stud
  • 46,922
  • 14
  • 101
  • 127