I want to scrape a website for external links and paths by using regex on href
html tag.
But I don't know if there is simpler way than my code:
import requests
import re
target_url = ("http://testphp.vulnweb.com/")
response = requests.get(target_url)
res = re.findall('href\=\"[\w.:/]+\"', response.content.decode("utf-8"))
for i in res:
patt = re.compile("\"[.:/\w]+\"")
not_raw = re.findall(patt, i)
raw = re.findall("[.:/\w]+", not_raw[0])
print(raw)
Is there a way, instead of using regex 3 times, to pick the path and links from an href
tag without capturing it?
I mean the res
variable output is like this:
href="https://www.acunetix.com/vulnerability-scanner/"
Can I use regex in a way to pick up the URL in the res
variable like the following?
https://www.acunetix.com/vulnerability-scanner/