3

I am trying to go through the HTML of a website and parse it looking for the max enrollment of a class. I tried checking for a substring in each line of the HTML file, but that would try to parse the wrong lines. So I am now using Regular Expressions. I have \t\t\t\t\t\t\t<td class="odd">([0-9])|([0-9][0-9])|([0-9][0-9][0-9])<\/td>\r\n as my regular expression right now, but this regular expression matches the max enrollment as well as the section number. Is there another way to go about what I am trying to extract from the webpage? The HTML code snippet is below:

<tr>
    <td class="tableHeader">Section</td>
    <td class="odd">001</td>
</tr>

<tr>
    <td class="tableHeader">Credits</td>
    <td class="even" align="left">  4.00</td>
</tr>

<tr>
<td class="tableHeader">Title</td>
<td class="odd">Linear Algebra</td>
</tr>

<tr>
    <td class="tableHeader">Campus</td>
    <td class="even" align="left">University City</td>
</tr>

<tr>
    <td class="tableHeader">Instructor(s)</td>
    <td class="odd">Guang  Yang</td>
</tr>
<tr>
    <td class="tableHeader">Instruction Type</td>
    <td class="even">Lecture</td>
</tr>

<tr>
    <td class="tableHeader">Max Enroll</td>
    <td class="odd">30</td>
</tr>
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
heinst
  • 8,520
  • 7
  • 41
  • 77
  • 3
    Read this: http://stackoverflow.com/a/1732454/3001761 – jonrsharpe May 08 '14 at 17:25
  • Have you tried [`HTMLParser`](https://docs.python.org/2/library/htmlparser.html)/[`html.parser`](https://docs.python.org/3/library/html.parser.html) instead? – admdrew May 08 '14 at 17:25
  • please help me help you, by improving my answer: what do you mean by the "looking for the max enrollment"? Can you give me example of what you try to get from your html example? – zmo May 08 '14 at 17:27
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Lukas Graf May 08 '14 at 17:34
  • 2
    do not agree about the dupe, it's not asking whether it can be done with a regex, it's wrongly trying to do that. – zmo May 08 '14 at 17:36
  • @zmo which is exactly what the OP of the dupe is trying to do. – Lukas Graf May 08 '14 at 17:38
  • 1
    This is not a duplicate. That OP is trying to actually match the tag name, class name, etc. I am just trying to extract the contents in such a way where I don't get the section number AND max enroll number. I just need help with getting only the Max Enroll number. – heinst May 08 '14 at 17:41
  • 1
    Well then instead of sitting there insulting the way I approached this problem, maybe it would be more productive to point me in the right direction, wouldn't it? – heinst May 08 '14 at 17:49
  • @LukasGraf This really isn't a duplicate anymore, as it's a specific question with correct, *specific* answers. – admdrew May 08 '14 at 17:58
  • @admdrew so what's the specific question? Can you point me to the question mark in the OP's post please? – Lukas Graf May 08 '14 at 18:00
  • What I don't get is that you're saying as a comment *I am just trying to extract the contents in such a way where I don't get the section number AND max enroll number. I just need help with getting only the Max Enroll number.*, but you're accepting the one answer that gives you the section number AND the max enroll number. I don't get your logic. – zmo May 08 '14 at 18:02
  • @zmo I accepted that answer before you added a few more things to yours, I went through all the answers again. – heinst May 08 '14 at 18:03
  • @LukasGraf `Is there another way to go about what I am trying to extract (the max enrollment of a class) from the webpage?` – admdrew May 08 '14 at 18:03
  • A decent answer, instead of spoon-feeding a correct solution to the OP, would include an explanation **why** HTML can't be parsed with regular expressions. – Lukas Graf May 08 '14 at 18:03
  • 1
    @LukasGraf then it would be most helpful if you could explain why you can't instead of being demeaning in the comment section – heinst May 08 '14 at 18:06
  • 2
    which why I'm giving a link in my all-caps disclaimer. [I could also write it using using toilet](http://patorjk.com/software/taag/#p=display&v=1&f=Big&t=DO%20NOT%20PARSE%20HTML)? – zmo May 08 '14 at 18:07
  • @heinst I already did, and the answer I linked does too: Because HTML isn't [regular](http://en.wikipedia.org/wiki/Chomsky_hierarchy). – Lukas Graf May 08 '14 at 18:08

3 Answers3

5

DO NOT PARSE HTML USING REGEXP.

Use the right tool for the right job.

Let's make an analogy to explain why it's wrong: it's like trying to have a 5 year old understand Hamlet, whereas he does not have the vocabulary and grammar to understand Shakespeare's, that he will get when he'll be able to process more abstract concepts.

Use either lxml or BeautifulSoup to do that.

As an example: to get a list of all the evens and all the odds:

>>> from lxml import etree
>>> tree = etree.HTML(your_html_text)
>>> odds = tree.xpath('//td[@class="odd"]/text()')
>>> evens = tree.xpath('//td[@class="even"]/text()')
>>> odds
['001', 'Linear Algebra', 'Guang  Yang', '30']
>>> evens
['  4.00', 'University City', 'Lecture']

edit:

I am just trying to extract the contents in such a way where I don't get the section number AND max enroll number. I just need help with getting only the Max Enroll number.

ok, now I'm getting what you want, so here's the solution using lxml:

>>> for elt in tree.xpath('//tr'):
...     if elt.xpath('td[@class="tableHeader"]')[0].text == "Max Enroll":
...         elt.xpath('td[@class="odd"]|td[@class="even"]')[0].text
... 
'30'

There you have only the max enroll number.

Using BeautifulSoup it's a bit easier:

>>> bs = BeautifulSoup(your_html_text)
>>> for t in bs.findAll('td', attrs={'class': 'tableHeader'}):
...   if t.text == "Max Enroll":
...     print t.findNext('td').text
'30'
Community
  • 1
  • 1
zmo
  • 24,463
  • 4
  • 54
  • 90
  • 1
    `soup.find('td', text="Max Enroll").find_next_sibling('td').text` would be easier. – alecxe May 08 '14 at 18:28
  • indeed, though I'm giving the more general approach here, so the OP can adapt to his dataset. – zmo May 08 '14 at 18:29
3

Use the tool that is specialized on parsing html, like BeautifulSoup:

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

For example, here's how you can get what you want:

from bs4 import BeautifulSoup

data = """your html here"""

soup = BeautifulSoup(data)
print soup.find('td', text="Max Enroll").find_next_sibling('td').text

Prints:

30
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • If I choose this method, I will not be able to give this script to friends very easily for them to use because it will use a library that they (most likely) will not have installed on their computer initially, correct? – heinst May 08 '14 at 17:45
  • @heinst well, `BeautifulSoup` is a third-party library that can be easily installed. Just include [`requirements.txt`](https://pip.pypa.io/en/1.1/requirements.html) file with script dependencies and give it to your friends. – alecxe May 08 '14 at 17:47
1

An alternate to zmo's answer, using BeautifulSoup:

from bs4 import BeautifulSoup

data = """
<snipped html>
"""

soup = BeautifulSoup(data)

for tableHeaders in soup.find_all('td', class_="tableHeader"):
    if tableHeaders.get_text() == "Max Enroll":
        print tableHeaders.find_next_siblings('td', class_="odd")[0].get_text()

Output:

30
Community
  • 1
  • 1
admdrew
  • 3,790
  • 4
  • 27
  • 39