Regex, selecting from second regex expression if only it is not in first regex expression

Question

For example- I have this text

www.google.com

<a href="www.google.com"> Google Homepage </a>

I wrote this (<a.*<\/a>) which captures anchor tag and this (www\.[\S]+(\b|$)) which selects any text which starts with www. but what i want it selects only www.google.com not the one inside anchor tag.

anything through which I can completely ignore anchor tag and select text only from remaining text.

To be more precise a regex which can: NOT OF (<a.*<\/a>) AND (www\.[\S]+(\b|$))

Hope, I'm clear with my question. Thanks for helping.

[Do not try to parse html with regex](https://stackoverflow.com/a/1732454/6320039) — Ulysse BN, Jun 29 '18 at 07:29

Lars-Olof Kreim · Accepted Answer · 2018-06-29T09:31:23.647

0

As I understand you want to select each url (starting with an www.) when it is not in the href attribut

This will work with an negative lookbehind

(?<!href=")(www\.[\S]+(\b|$))

This regex will select the url when there is no href=" before it.

Be aware js does not support a negative lookbehind, tested on https://regex101.com/

Edit due to addtitons in the comments: If you want to sort out everything in an html-tag (between before closing >) this should work for you:

(?![^<]*>)(([a-zA-Z0-9\-\_\.])+@[a-zA-Z\_]+?(\.[a-zA-Z]{2,6})+)

It's an negative lookahead saying that it should not match when having an unlimited times not < followed by >

Good thing about negative lookahead is, that it is supported in JS :)

edited Jun 29 '18 at 09:31

answered Jun 29 '18 at 07:33

Lars-Olof Kreim

280
1
8

thanks for helping, any suggestions for this `sandeep.gupta@xyz.com` ` Sandeep Gupta ` – Sandeep Gupta Jun 29 '18 at 08:16
I'm trying this regex - `(?<!mailto:)(([a-zA-Z0-9\-\_\.])+@[a-zA-Z\_]+?(\.[a-zA-Z]{2,6})+)` – Sandeep Gupta Jun 29 '18 at 08:17
the problem here is, that this will match the mail-adress starting from the second char – Lars-Olof Kreim Jun 29 '18 at 08:20
can you please explain :/ how this works - `(?![^<]*>)` btw thanks for helping. – Sandeep Gupta Jun 29 '18 at 08:45
This is a negative lookahead - so searching for what comes after the actual searched expression `[^>]*>` means *not* the character `>` for unlimited times followed by an `>` – Lars-Olof Kreim Jun 29 '18 at 09:29
One more doubt, `(?![^<]*>)([^\/])(www\.[\S]+(\b|$))` this is what i used to select urls starting from `www` but not with `http://www` so this works fine but is including a white space infront of the selected text. what could be the optimal solution to this problem? – Sandeep Gupta Jun 29 '18 at 09:37
what about `(?![^<]*>)(?<!\/)(www\.[\S]+(\b|$))` adding a negativ lookbehind for the `/`. Be aware that negativelookbehind is not supported in JS but was assed to the ES defintion this Year. So this solution is not save to work in JS at the moment – Lars-Olof Kreim Jun 29 '18 at 09:45
Though I can use negative look behind but don't to use here because of this [link](https://stackoverflow.com/questions/641407/javascript-negative-lookbehind-equivalent) and `(?![^<]*>)(?<!\/)(www\.[\S]+(\b|$))` – Sandeep Gupta Jun 29 '18 at 09:45
`(?![^<]*>)([^\/])(www\.[\S]+(\b|$))` why this inlcudes extra space ? – Sandeep Gupta Jun 29 '18 at 09:58
what if I want only anchor tag to be considered while doing this? `(?![^)(^|[^\/])(www\.[\S]+(\b|$))` i tried this – Sandeep Gupta Jul 01 '18 at 04:59
your example `(?![^<]*>)([^\/])(www\.[\S]+(\b|$)) ` does include an extra space since `([^\/])` selects one character *not* `/` so this is any character else than the `/` – Lars-Olof Kreim Jul 02 '18 at 11:40
`(?![^)(^|[^\/])(www\.[\S]+(\b|$))` the problem here is that `[]` forms a group of characters... not an special order of characters, you basically can not search for ` – Lars-Olof Kreim Jul 02 '18 at 11:42
to learn regex there are several ressources e.g. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions or https://github.com/zeeshanu/learn-regex or https://regexone.com/ – Lars-Olof Kreim Jul 02 '18 at 11:43
is there any way to do this, `(?!([^\<]+)<\/a>)(\b(https?|ftp):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[-A-Z0-9+&@#\/%=~_|])` I'm using this but it fails for the case when we have angular brace inside anchor tag. any help ASAP! – Sandeep Gupta Jul 03 '18 at 10:42
search for this particular tag only, starting from `` – Sandeep Gupta Jul 03 '18 at 10:43

Regex, selecting from second regex expression if only it is not in first regex expression

1 Answers1