-1

I am trying to fetch non-http(s) urls from anchor tag. I need to match the entire anchor tag if such an url is found.

Example :

This should match: <a href="example.com/index.html"> bla</a>

This shouldn't match: <a href="https://www.google.com/">bla2 </a>

I have been able to build this regex so far:

(\<a[\s\S]*?)(?<=href)(?:(=[\"\'])|(=))(?!(http[s]?)|(ww[w]?)|(#)|(\/\/))
(?P<url>[\S]*?)(?=([\"\'])|(\s))([\s\S]*?\>)

But this gives me a match even for the one with HTTP.

With this regex : (?<=href=[\"\'])(?!(http[s]?)|(ww[w]?))(?P<url>[\S]+)(?=[\"\']) I am able to get only the non-http url but i need the entire content of <a> tag getting matched, too.

Any suggestions would be great. Happy if this can be further improved. PS: I can not use beautifulsoup. So please suggest a better regex for my problem.

Community
  • 1
  • 1
  • You cant parse html with regex: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – DZDomi Jun 01 '18 at 10:34
  • 1
    don't use regex, use a HTML-parser. – Daniel Jun 01 '18 at 10:35
  • I know it isn't a great idea to use regex for html, but i am constrained to do it by making use of regex. HTML parser wouldn't help me much. Any suggestion on modifying the regex to get what i need would be very much helpful. – Akash Sundaresh Jun 01 '18 at 11:51

1 Answers1

0

This might work:

(<a[^>]*href=[\"\'](?!http|ww)(?:\S+)[\"\'][^>]*>)

This will match <a href="example.com/index.html">, if you need everything until </a> then add e.g. .*?</\s*a> before the closing parenthesis.

Explanation

  • (?!http|ww): negative lookahead, actually https? is unnecessary here because (?!http) will already match both http and https (same for ww and www)
  • (?:\S+): url. This could be improved, since many symbols aren't allowed in URLs, but it is sufficient for the moment.
  • [^>]* a might potentially contain other stuff.
Community
  • 1
  • 1
Snow bunting
  • 1,120
  • 8
  • 28