Matching url in HTML using regex

Question

It's been a while since I've used regex, and I feel like this should be simple to figure out.

I have a web page full of links that looks like the string_to_match in the below code. I want to grab just the numbers in the links, like number "58" in the string_to_match. For the life of me I can't figure it out.

import re
string_to_match = '<a href="/ncf/teams/roster?teamId=58">Roster</a>'
re.findall('<a href="/ncf/teams/roster?teamId=(/d+)">Roster</a>',string_to_match)

Why, why, why do people keep trying to [parse HTML with regular expressions?!?](http://stackoverflow.com/a/1732454/364696) Use [an HTML parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). It can find the tags you care about with the expected attributes, pull it out for you, and actually [parse the URL](https://docs.python.org/3/library/urllib.parse.html) to get the `GET` parameters, which will be correct and largely self-documenting code. Even if the regex might be faster, unmaintainable and possibly wrong code is not an improvement. — ShadowRanger, Jan 19 '17 at 03:50
Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — MK., Jan 19 '17 at 04:15

alecxe · Accepted Answer · 2017-01-19T03:59:54.250

Instead of using regular expressions, you can use a combination of HTML parsing (using BeautifulSoup parser) to locate the desired link and extract the href attribute value and URL parsing, which in this case, we'll use regular expressions for:

import re
from bs4 import BeautifulSoup

data = """
<body>
    <a href="/ncf/teams/roster?teamId=58">Roster</a>
</body>
"""

soup = BeautifulSoup(data, "html.parser")
link = soup.find("a", text="Roster")["href"]

print(re.search(r"teamId=(\d+)", link).group(1))

Prints 58.

score 0 · Answer 2 · answered Jan 19 '17 at 04:11

0

I would recommend using BeautifulSoup or lxml, it's worth the learning curve.

...But if you still want to use regexp

re.findall('href="[^"]*teamId=(\d+)',string_to_match)

answered Jan 19 '17 at 04:11

xvan

4,554
1
22
37

Matching url in HTML using regex

2 Answers2