Python - regex lookup for multiple lines of HTML

Question

How do you parse multiple lines in HTML using regex in Python. I have managed to string match patterns on the same line using the code below.

i=0
while i<len(newschoollist):
    url = "http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode="+ newschoollist[i] +"&orgtypecode=6&"
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '>Phone:</td><td>(.+?)</td></tr>'
    pattern = re.compile(regex)
    value = re.findall(pattern,htmltext)
    print newschoollist[i], valuetag, value
    i+=1

However when i try to recognize more complicated HTML like this...

<td>Attendance Rate</td> 
<td class='center'>  90.1</td>

I get null values. I believe the problem is with my syntax. I have googled regex and read most of the documentation but am looking for some help with this kind of application. I am hoping someone can point me in the right direction. Is there a (.+?) like combination that will help me tell regex to jump down a line of HTML?

What i want the findall to pick up is the 90.1 when it finds "Attendance Rate "

Thanks!

[Don't parse HTML with regex!](http://stackoverflow.com/a/1732454/418066) — Biffen, Feb 20 '15 at 23:10
The simple answer is to use the DOTALL flag; the correct answer is what @Biffen said: don't use regex! — MRAB, Feb 20 '15 at 23:35

score 0 · Accepted Answer · edited May 23 '17 at 12:12

Use an HTML Parser. Example using BeautifulSoup:

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode=00350326'

soup = BeautifulSoup(urlopen(url))
for label in soup.select('div#whiteboxRight table td'):
    value = label.find_next_sibling('td')
    if not value:
        continue

    print label.get_text(strip=True), value.get_text(strip=True)
    print "----"

Prints (profile contact information):

...
----
NCES ID: 250279000331
----
Web Site: http://www.bostonpublicschools.org
----
MA School Type: Public School
----
NCES School Reconstituted: No
...

score 0 · Answer 2 · answered Feb 22 '15 at 21:02

0

I ended up using (soup.get_text()) and it worked great. Thanks!

answered Feb 22 '15 at 21:02

lecorbu

343
1
3
7

Python - regex lookup for multiple lines of HTML

2 Answers2