0

It's been a while since I've used regex, and I feel like this should be simple to figure out.

I have a web page full of links that looks like the string_to_match in the below code. I want to grab just the numbers in the links, like number "58" in the string_to_match. For the life of me I can't figure it out.

import re
string_to_match = '<a href="/ncf/teams/roster?teamId=58">Roster</a>'
re.findall('<a href="/ncf/teams/roster?teamId=(/d+)">Roster</a>',string_to_match)
user2859829
  • 125
  • 2
  • 8
  • 4
    Why, why, why do people keep trying to [parse HTML with regular expressions?!?](http://stackoverflow.com/a/1732454/364696) Use [an HTML parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). It can find the tags you care about with the expected attributes, pull it out for you, and actually [parse the URL](https://docs.python.org/3/library/urllib.parse.html) to get the `GET` parameters, which will be correct and largely self-documenting code. Even if the regex might be faster, unmaintainable and possibly wrong code is not an improvement. – ShadowRanger Jan 19 '17 at 03:50
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – MK. Jan 19 '17 at 04:15

2 Answers2

1

Instead of using regular expressions, you can use a combination of HTML parsing (using BeautifulSoup parser) to locate the desired link and extract the href attribute value and URL parsing, which in this case, we'll use regular expressions for:

import re
from bs4 import BeautifulSoup

data = """
<body>
    <a href="/ncf/teams/roster?teamId=58">Roster</a>
</body>
"""

soup = BeautifulSoup(data, "html.parser")
link = soup.find("a", text="Roster")["href"]

print(re.search(r"teamId=(\d+)", link).group(1))

Prints 58.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
0

I would recommend using BeautifulSoup or lxml, it's worth the learning curve.

...But if you still want to use regexp

re.findall('href="[^"]*teamId=(\d+)',string_to_match)
xvan
  • 4,554
  • 1
  • 22
  • 37