Python regex to get non http(s) urls from tag from html content

Question

I am trying to fetch non-http(s) urls from anchor tag. I need to match the entire anchor tag if such an url is found.

Example :

This should match: <a href="example.com/index.html"> bla</a>

This shouldn't match: <a href="https://www.google.com/">bla2 </a>

I have been able to build this regex so far:

(\<a[\s\S]*?)(?<=href)(?:(=[\"\'])|(=))(?!(http[s]?)|(ww[w]?)|(#)|(\/\/))
(?P<url>[\S]*?)(?=([\"\'])|(\s))([\s\S]*?\>)

But this gives me a match even for the one with HTTP.

With this regex : (?<=href=[\"\'])(?!(http[s]?)|(ww[w]?))(?P<url>[\S]+)(?=[\"\']) I am able to get only the non-http url but i need the entire content of <a> tag getting matched, too.

Any suggestions would be great. Happy if this can be further improved. PS: I can not use beautifulsoup. So please suggest a better regex for my problem.

You cant parse html with regex: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — DZDomi, Jun 01 '18 at 10:34
I know it isn't a great idea to use regex for html, but i am constrained to do it by making use of regex. HTML parser wouldn't help me much. Any suggestion on modifying the regex to get what i need would be very much helpful. — Akash Sundaresh, Jun 01 '18 at 11:51

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

0

This might work:

(<a[^>]*href=[\"\'](?!http|ww)(?:\S+)[\"\'][^>]*>)

This will match <a href="example.com/index.html">, if you need everything until </a> then add e.g. .*?</\s*a> before the closing parenthesis.

Explanation

(?!http|ww): negative lookahead, actually https? is unnecessary here because (?!http) will already match both http and https (same for ww and www)
(?:\S+): url. This could be improved, since many symbols aren't allowed in URLs, but it is sufficient for the moment.
[^>]* a might potentially contain other stuff.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 02 '18 at 13:32

Snow bunting

1,120
8
28

Thanks.. This almost works.. But it wouldnt match non-quoted urls like Is there any ways i could match this too?? – Akash Sundaresh Jun 04 '18 at 05:49
Maybe `(]*href=[\"\']?(?!http|ww)(?:\S+?)[\"\']?[^>]*>)`. Also change the `\S+?` to a pattern that matches a URL good enough. – Snow bunting Jun 04 '18 at 15:32
Thanks for your time, This does match unquoted urls but it also matches urls starting with http/www though we have written negative lookaheads.. This has been the case for me too.. If i try to match unquoted urls https too gets matched.. If i try to make it not match https, even the unquoted urls also gets missed. – Akash Sundaresh Jun 06 '18 at 05:47
yeah sry, wasn't thinking. How about `(?:h(?!ttp)|w(?!w)|[^wh])\S*` instead of `(?:\S+?)`? This would match strings that do not start with `http` or `ww`. And then the optional quotes and stuff around as above. – Snow bunting Jun 06 '18 at 10:17

Python regex to get non http(s) urls from tag from html content

Example :

1 Answers1

Explanation