Regex - Read date from HTML

Question

I wounder if anyone could tell me what I'm doing wrong with this code. I have a HTML and want to read out the Founded in year - which in this case is 1949. How do I that? Please note the space and blank line.

Below is the code

<h4>  Founded in

</h4></td><td><h5> <!--10305--> 1949</h5></td></tr> <tr>

And this is the code that I'm using. And nothing is being printed.

myf = 'THE HTML HERE'
myf.replace("<!--10305-->", "")
year = re.findall(r"<h4>  Founded in.*? (.*?)</h5></td></tr> <tr>", myf, re.DOTALL)
print year

Any help would be appreciated.

"I wounder if anyone could tell me what I'm doing wrong with this code." Maybe it's that you're using Regex to parse HTML... — Veedrac, Sep 25 '13 at 15:13
Use [lxml](http://lxml.de/parsing.html#parsing-html) probably with XPath or CSS Selector. — Cristian Ciupitu, Sep 25 '13 at 15:15
Did posting that one link to the HTML regex Q&A go out of style? Because if not... — austin, Sep 25 '13 at 15:16
You are better off using a well tested and stable HTML parser like lxml or BeautifulSoup to glean out the required information — Prahalad Deshpande, Sep 25 '13 at 15:44
@austin - That is one of my favorite answers, but I have come to dread seeing it automatically invoked with every regex question. I think some well-defined simple cases are fine for regex, and a lot simpler than tree parsing... — beroe, Sep 25 '13 at 18:22

score 2 · Answer 1 · answered Sep 25 '13 at 15:37

Using lxml with xpath:

>>> import lxml.html
>>>
>>> root = lxml.html.fromstring('''
... <tr>
... <td>
... <h4>  Founded in
...
... </h4></td><td><h5> <!--10305--> 1949</h5></td></tr>
... ''')
>>> root.xpath('//h4[contains(text(), "Founded in")]/parent::*/following-sibling::*')[0].text_content().strip()
'1949'

score 0 · Accepted Answer · answered Sep 25 '13 at 15:17

Strings are immutable. This:

myf.replace("<!--10305-->", "")

returns a value but does not change myf. You want:

myf = myf.replace("<!--10305-->", "")

Further, this code prints something anyway:

import re

myf = """\
<h4>  Founded in

</h4></td><td><h5> <!--10305--> 1949</h5></td></tr> <tr>"""

myf.replace("<!--10305-->", "")

year = re.findall(r"<h4>  Founded in.*? (.*?)</h5></td></tr> <tr>", myf, re.DOTALL)

year
#>>> ['<!--10305--> 1949']

so the real problem is elsewhere.

Thanks, not sure that was wrong with my code, copy/pasted yours and it worked. — Helen Neely, Sep 25 '13 at 17:35

Regex - Read date from HTML

2 Answers2