-1

I want to scrape a website for external links and paths by using regex on href html tag.

But I don't know if there is simpler way than my code:

import requests
import re

target_url = ("http://testphp.vulnweb.com/")

response = requests.get(target_url)
res = re.findall('href\=\"[\w.:/]+\"', response.content.decode("utf-8"))

for i in res:
    patt = re.compile("\"[.:/\w]+\"")
    not_raw = re.findall(patt, i)
    raw = re.findall("[.:/\w]+", not_raw[0])
    print(raw)

Is there a way, instead of using regex 3 times, to pick the path and links from an href tag without capturing it? I mean the res variable output is like this:

href="https://www.acunetix.com/vulnerability-scanner/"

Can I use regex in a way to pick up the URL in the res variable like the following?

https://www.acunetix.com/vulnerability-scanner/
Alireza
  • 46
  • 2
  • 6

2 Answers2

0

Yes, you can use "caputuring" and "non-capturing" matches. Example:

re.findall(r'(?:href=")([^\"]+)(?:")', response.content.decode("utf-8"))

The ?: in (?:href=") means that this part will not be returned as part of the matched string.

From https://docs.python.org/3/library/re.html:

(?:...) A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

C14L
  • 12,153
  • 4
  • 39
  • 52
0

Parsing HTML with regex is a poor choice. See this to learn why.

To get all the href attributes usa a HTML library like BeautifulSoup and try this:

import requests
from bs4 import BeautifulSoup

response = requests.get("http://testphp.vulnweb.com/").content
soup = BeautifulSoup(response, "html.parser").find_all("a", href=True)
href_ = [a["href"] for a in soup if "http" in a["href"]]
print("\n".join(href_))

Output:

https://www.acunetix.com/
https://www.acunetix.com/vulnerability-scanner/
http://www.acunetix.com
https://www.acunetix.com/vulnerability-scanner/php-security-scanner/
https://www.acunetix.com/blog/articles/prevent-sql-injection-vulnerabilities-in-php-applications/
http://www.eclectasy.com/Fractal-Explorer/index.html
http://www.acunetix.com
baduker
  • 19,152
  • 9
  • 33
  • 56