I am trying to fetch non-http(s) urls from anchor tag. I need to match the entire anchor tag if such an url is found.
Example :
This should match: <a href="example.com/index.html"> bla</a>
This shouldn't match: <a href="https://www.google.com/">bla2 </a>
I have been able to build this regex so far:
(\<a[\s\S]*?)(?<=href)(?:(=[\"\'])|(=))(?!(http[s]?)|(ww[w]?)|(#)|(\/\/))
(?P<url>[\S]*?)(?=([\"\'])|(\s))([\s\S]*?\>)
But this gives me a match even for the one with HTTP.
With this regex : (?<=href=[\"\'])(?!(http[s]?)|(ww[w]?))(?P<url>[\S]+)(?=[\"\'])
I am able to get only the non-http url but i need the entire content of <a>
tag getting matched, too.
Any suggestions would be great. Happy if this can be further improved. PS: I can not use beautifulsoup. So please suggest a better regex for my problem.