Python Regular Expression Webpage

Question

I need help writing a regular expression for a webpage to extract some data. The webpage is: http://www.city-data.com/city/Addison-Texas.html

I want to return "Dallas" from this bit of html code:

<a href="/county/Dallas_County-TX.html">Dallas County</a>
</p>
<b>Population in 2012:</b>

This is the regular expression I have written so far, but it does not seem to work. Any idea what I did wrong?

(">(.)/sCounty</a>\n</p>\n<b>Population in 2012:</b>")

I still receive the same error: Traceback (most recent call last): File "", line 1, in IndexError: list index out of range — stochasticcrap, Feb 02 '14 at 09:49
Take one of the solutions in [this question](http://stackoverflow.com/q/11709079/18771). You don't want to use regular expressions on HTML, *because regular expressions are unable to parse HTML*. — Tomalak, Feb 02 '14 at 09:59
The secret is to never use regex to parse html. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Slater Victoroff, Feb 02 '14 at 10:08

nitish712 · Answer 1 · 2014-02-02T10:05:43.467

1

Well, the other way around of solving it, rather than using regex is using split function.

s.split('</a>')[0].split('>')[1].split(' ')[0]

should return the answer you intended.

However, using the above method becomes tedious for more complex HTML. you can use HTMLParser module instead.

edited Feb 02 '14 at 10:05

answered Feb 02 '14 at 09:59

nitish712

+1 The only right answer when someone asks for an html regex is to tell them to stop using regex for html. – Slater Victoroff Feb 02 '14 at 10:10

1 Answers1