0

How do you parse multiple lines in HTML using regex in Python. I have managed to string match patterns on the same line using the code below.

i=0
while i<len(newschoollist):
    url = "http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode="+ newschoollist[i] +"&orgtypecode=6&"
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '>Phone:</td><td>(.+?)</td></tr>'
    pattern = re.compile(regex)
    value = re.findall(pattern,htmltext)
    print newschoollist[i], valuetag, value
    i+=1

However when i try to recognize more complicated HTML like this...

<td>Attendance Rate</td> 
<td class='center'>  90.1</td>  

I get null values. I believe the problem is with my syntax. I have googled regex and read most of the documentation but am looking for some help with this kind of application. I am hoping someone can point me in the right direction. Is there a (.+?) like combination that will help me tell regex to jump down a line of HTML?

What i want the findall to pick up is the 90.1 when it finds "Attendance Rate "

Thanks!

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
lecorbu
  • 343
  • 1
  • 3
  • 7
  • 1
    [Don't parse HTML with regex!](http://stackoverflow.com/a/1732454/418066) – Biffen Feb 20 '15 at 23:10
  • 1
    The simple answer is to use the DOTALL flag; the correct answer is what @Biffen said: don't use regex! – MRAB Feb 20 '15 at 23:35

2 Answers2

0

Use an HTML Parser. Example using BeautifulSoup:

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://profiles.doe.mass.edu/profiles/general.aspx?topNavId=1&orgcode=00350326'

soup = BeautifulSoup(urlopen(url))
for label in soup.select('div#whiteboxRight table td'):
    value = label.find_next_sibling('td')
    if not value:
        continue

    print label.get_text(strip=True), value.get_text(strip=True)
    print "----"

Prints (profile contact information):

...
----
NCES ID: 250279000331
----
Web Site: http://www.bostonpublicschools.org
----
MA School Type: Public School
----
NCES School Reconstituted: No
...
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
0

I ended up using (soup.get_text()) and it worked great. Thanks!

lecorbu
  • 343
  • 1
  • 3
  • 7