Can't capture href tag content with regex first time

Question

I want to scrape a website for external links and paths by using regex on href html tag.

But I don't know if there is simpler way than my code:

import requests
import re

target_url = ("http://testphp.vulnweb.com/")

response = requests.get(target_url)
res = re.findall('href\=\"[\w.:/]+\"', response.content.decode("utf-8"))

for i in res:
    patt = re.compile("\"[.:/\w]+\"")
    not_raw = re.findall(patt, i)
    raw = re.findall("[.:/\w]+", not_raw[0])
    print(raw)

Is there a way, instead of using regex 3 times, to pick the path and links from an href tag without capturing it? I mean the res variable output is like this:

href="https://www.acunetix.com/vulnerability-scanner/"

Can I use regex in a way to pick up the URL in the res variable like the following?

https://www.acunetix.com/vulnerability-scanner/

Probably because scraping HTML with regex is generally discouraged. Use BeautifulSoup. — PaulMcG, Dec 29 '20 at 20:44

C14L · Accepted Answer · 2020-12-28T17:24:50.137

Yes, you can use "caputuring" and "non-capturing" matches. Example:

re.findall(r'(?:href=")([^\"]+)(?:")', response.content.decode("utf-8"))

The ?: in (?:href=") means that this part will not be returned as part of the matched string.

From https://docs.python.org/3/library/re.html:

(?:...) A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

score 0 · Answer 2 · answered Feb 20 '21 at 09:24

Parsing HTML with regex is a poor choice. See this to learn why.

To get all the href attributes usa a HTML library like BeautifulSoup and try this:

import requests
from bs4 import BeautifulSoup

response = requests.get("http://testphp.vulnweb.com/").content
soup = BeautifulSoup(response, "html.parser").find_all("a", href=True)
href_ = [a["href"] for a in soup if "http" in a["href"]]
print("\n".join(href_))

Output:

https://www.acunetix.com/
https://www.acunetix.com/vulnerability-scanner/
http://www.acunetix.com
https://www.acunetix.com/vulnerability-scanner/php-security-scanner/
https://www.acunetix.com/blog/articles/prevent-sql-injection-vulnerabilities-in-php-applications/
http://www.eclectasy.com/Fractal-Explorer/index.html
http://www.acunetix.com

Can't capture href tag content with regex first time

2 Answers2