1

I am having trouble figuring out how to select part of an html link using regex

say the link is:

<a href="race?raceid=1234">Mushroom Cup</a>

I have figured out how to get the race id, but I cannot for the life of me figure out how to use a regular expression to find just 'Mushroom cup'. The best I can do is get 1234>Mushroom Cup.

I'm new to regular expressions and it is just too much for me to comprehend.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
amchugh89
  • 1,276
  • 1
  • 14
  • 33
  • 2
    How much could the input vary? If you're extracting this data from several places in a large document, it might be worth using an HTML parser instead of regex. – Asad Saeeduddin Aug 19 '13 at 20:59

2 Answers2

1

something very much like

re.findall('<a href="race\?raceid=(\d+)">([^<]+)</a>',html_text)
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
1

Don't ever use regex for parsing HTML. Instead use HTML parsers like lxml or BeautifulSoup.

Here's an example using BeautifulSoup:

import urlparse
from bs4 import BeautifulSoup

soup = BeautifulSoup("""
<html>
<head>
    <title>Python regex url grab - Stack Overflow</title>
</head>
<body>
    <a href="race?raceid=1234">Mushroom Cup</a>
</body>
</html
""")

link = soup.find('a')
par = urlparse.parse_qs(urlparse.urlparse(link.attrs['href']).query)
print par['raceid'][0]   # prints 1234
print link.text   # prints Mushroom Cup

Note, that urlparse is used for getting link parameter's value. See more here: Retrieving parameters from a URL.

Also see:

Hope that helps.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • oh...that seems nicer – amchugh89 Aug 19 '13 at 21:09
  • 1
    +1 since I agree in general that parsing html with a regex is a bad idea, but it would be nice to demonstrate why this solution may be superior than the simple regex for the OP's question. I know there are several reasons not to use regex (mainly that html is a nested language and regex doesnt handle nesting so well (stateless)) – Joran Beasley Aug 19 '13 at 22:45