-1

I need help writing a regular expression for a webpage to extract some data. The webpage is: http://www.city-data.com/city/Addison-Texas.html

I want to return "Dallas" from this bit of html code:

<a href="/county/Dallas_County-TX.html">Dallas County</a>
</p>
<b>Population in 2012:</b>

This is the regular expression I have written so far, but it does not seem to work. Any idea what I did wrong?

(">(.)/sCounty</a>\n</p>\n<b>Population in 2012:</b>")
stochasticcrap
  • 350
  • 3
  • 16
  • 2
    Space isn't `/s` but `\s`. – devnull Feb 02 '14 at 09:48
  • I still receive the same error: Traceback (most recent call last): File "", line 1, in IndexError: list index out of range – stochasticcrap Feb 02 '14 at 09:49
  • Take one of the solutions in [this question](http://stackoverflow.com/q/11709079/18771). You don't want to use regular expressions on HTML, *because regular expressions are unable to parse HTML*. – Tomalak Feb 02 '14 at 09:59
  • The secret is to never use regex to parse html. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Slater Victoroff Feb 02 '14 at 10:08
  • 1
    `(.)` matches a single character. – Jerry Feb 02 '14 at 10:19

1 Answers1

1

Well, the other way around of solving it, rather than using regex is using split function.

s.split('</a>')[0].split('>')[1].split(' ')[0]

should return the answer you intended.

However, using the above method becomes tedious for more complex HTML. you can use HTMLParser module instead.

nitish712
  • 19,504
  • 5
  • 26
  • 34