1

Find all the url links in a html text using regex Arguments. below text assigned to html vaiable.

html = """
<a href="#fragment-only">anchor link</a>
<a id="some-id" href="/relative/path#fragment">relative link</a>
<a href="//other.host/same-protocol">same-protocol link</a>
<a href="https://example.com">absolute URL</a>
"""

output should be like that:

["/relative/path","//other.host/same-protocol","https://example.com"]

The function should ignore fragment identifiers (link targets that begin with #). I.e., if the url points to a specific fragment/section using the hash symbol, the fragment part (the part starting with #) of the url should be stripped before it is returned by the function

//I have tried this bellow one but not working its only give output: ["https://example.com"]

 urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', html)
 print(urls)

2 Answers2

1

You could try using positive lookbehind to find the quoted strings in front of href= in html

pattern = re.compile(r'(?<=href=\")(?!#)(.+?)(?=#|\")')
urls = re.findall(pattern, html)

See this answer for more on how matching only up to the '#' character works, and here if you want a breakdown of the RegEx overall

JRiggles
  • 4,847
  • 1
  • 12
  • 27
  • how to remove the path which starts with hash(#)? output should be something like that - > ["/relative/path","//other.host/same-protocol","https://example.com"] but your one is giving something like that -> ['#fragment-only', '/relative/path#fragment', '//other.host/same-protocol', 'https://example.com'] – Mahamodul Shakil Oct 20 '22 at 20:52
  • I've updated the RegEx - it should trap all of your required cases now! – JRiggles Oct 20 '22 at 22:37
1
from typing import List

html = """
<a href="#fragment-only">anchor link</a>
<a id="some-id" href="/relative/path#fragment">relative link</a>
<a href="//other.host/same-protocol">same-protocol link</a>
<a href="https://example.com">absolute URL</a>
"""

href_prefix = "href=\""


def get_links_from_html(html: str, result: List[str] = None) -> List[str]:
    if result == None:
        result = []

    is_splitted, _, rest = html.partition(href_prefix)

    if not is_splitted:
        return result

    link = rest[:rest.find("\"")].partition("#")[0]

    if link:
        result.append(link)
    return get_links_from_html(rest, result)


print(get_links_from_html(html))
syscloud
  • 61
  • 1
  • 4