1

I wounder if anyone could tell me what I'm doing wrong with this code. I have a HTML and want to read out the Founded in year - which in this case is 1949. How do I that? Please note the space and blank line.

Below is the code

<h4>  Founded in

</h4></td><td><h5> <!--10305--> 1949</h5></td></tr> <tr>

And this is the code that I'm using. And nothing is being printed.

myf = 'THE HTML HERE'
myf.replace("<!--10305-->", "")
year = re.findall(r"<h4>  Founded in.*? (.*?)</h5></td></tr> <tr>", myf, re.DOTALL)
print year

Any help would be appreciated.

Helen Neely
  • 4,666
  • 8
  • 40
  • 64
  • 2
    "I wounder if anyone could tell me what I'm doing wrong with this code." Maybe it's that you're using Regex to parse HTML... – Veedrac Sep 25 '13 at 15:13
  • 1
    Use [lxml](http://lxml.de/parsing.html#parsing-html) probably with XPath or CSS Selector. – Cristian Ciupitu Sep 25 '13 at 15:15
  • Did posting that one link to the HTML regex Q&A go out of style? Because if not... – austin Sep 25 '13 at 15:16
  • You are better off using a well tested and stable HTML parser like lxml or BeautifulSoup to glean out the required information – Prahalad Deshpande Sep 25 '13 at 15:44
  • @austin - That is one of my favorite answers, but I have come to dread seeing it automatically invoked with every regex question. I think some well-defined simple cases are fine for regex, and a lot simpler than tree parsing... – beroe Sep 25 '13 at 18:22

2 Answers2

2

Using lxml with xpath:

>>> import lxml.html
>>>
>>> root = lxml.html.fromstring('''
... <tr>
... <td>
... <h4>  Founded in
...
... </h4></td><td><h5> <!--10305--> 1949</h5></td></tr>
... ''')
>>> root.xpath('//h4[contains(text(), "Founded in")]/parent::*/following-sibling::*')[0].text_content().strip()
'1949'
falsetru
  • 357,413
  • 63
  • 732
  • 636
0

Strings are immutable. This:

myf.replace("<!--10305-->", "")

returns a value but does not change myf. You want:

myf = myf.replace("<!--10305-->", "")

Further, this code prints something anyway:

import re

myf = """\
<h4>  Founded in

</h4></td><td><h5> <!--10305--> 1949</h5></td></tr> <tr>"""

myf.replace("<!--10305-->", "")

year = re.findall(r"<h4>  Founded in.*? (.*?)</h5></td></tr> <tr>", myf, re.DOTALL)

year
#>>> ['<!--10305--> 1949']

so the real problem is elsewhere.

Veedrac
  • 58,273
  • 15
  • 112
  • 169