How can one search for some text which is not part of a url?

Question

Suppose the text to search is pqr.

"http://abc.zzz/pqr/xyz"      -> Should not match
"/pqr/"                       -> Should Match
"pqr"                         -> Should Match
"http://abc.zzz/pqr/pqr/"     -> Should not match
"http://abc.zzz/pqr/pqr/ pqr" -> Should match the last "pqr"
"www.pqr.zzz"                 -> Should not match

I tried using the following regex,

((?:(?:(?:https?|ftp|file|mailto):)|www)[^ ]+?)?(pqr)

I then looked for group 1, if it is empty then I was considering it as a match. But this fails for http://abc.zzz/pqr/pqr/

Any help here in detecting if the text to match is not part of a url?

The worst case I think is to detect all the urls first and then store the start and end indexes of the matched urls. Then try to match pqr and exclude all those which are part of the url. I was thinking if there is something that can be done better.

What do you need to match `pqr` for? Replace with something? The best way is to match the URL, and then the `pqr` (in an alternation group). — Wiktor Stribiżew, May 05 '16 at 12:50
In my case `pqr` itself is a regular expression which might not always be a part of a url. — pratZ, May 05 '16 at 12:52
The worst case I think is to detect all the urls first and then store the start and end indexes of the matched urls. Then try to match `pqr` and exclude all those which are part of the url. — pratZ, May 05 '16 at 12:55
Yes, that is why I say: match a URL in one branch and capture it, and then match the `pqr`. Something like `((?:https?|ftp|file|mailto)://\S*)|pqr` to replace with `$1` (if you want to remove `pqr`) or use a callback method to differentiate the actions. — Wiktor Stribiżew, May 05 '16 at 12:57
Thanks for that. I was just hoping if there is something that can be done better here. I will go for the same solution if thats the only way to do it. — pratZ, May 05 '16 at 13:02
Well, if you are using .NET regex, you can use [`(?<!(?:https?|ftp|file|mailto)://(?:www\.)?\S*)(?<!\bwww\.)pqr`](http://regexstorm.net/tester?p=(%3f%3c!(%3f%3ahttps%3f%7cftp%7cfile%7cmailto)%3a%2f%2f(%3f%3awww%5c.)%3f%5cS*)(%3f%3c!%5cbwww%5c.)pqr&i=http%3a%2f%2fabc.zzz%2fpqr%2fxyz%0d%0a+-%3e+Should+not+match%0d%0a%2fpqr%2f%0d%0a+-%3e+Should+Match%0d%0apqr%0d%0a+-%3e+Should+Match%0d%0ahttp%3a%2f%2fabc.zzz%2fpqr%2fpqr%2f%0d%0a+-%3e+Should+not+match%0d%0ahttp%3a%2f%2fabc.zzz%2fpqr%2fpqr%2f+pqr%0d%0a+-%3e+Should+match+the+last+%22pqr%22%0d%0awww.pqr.zzz%0d%0a+-%3e+Should+not+match). — Wiktor Stribiżew, May 05 '16 at 13:03
@WiktorStribiżew I'm actually using Java. But thanks a lot, I never knew that we can quantifiers in lookbehinds in java regex. You can post your solution as an answer. Just a minor correction that there is no `//` in `mailto:` url scheme. — pratZ, May 05 '16 at 13:47

score 2 · Accepted Answer · answered May 05 '16 at 14:01

Taking into account you are using Java, you can leverage the constrained-width lookbehind that Java regex engine supports. It means you can use {n,m} limiting quantifier in the pattern. Right now, Java 8 supports even * and + quantifiers inside a lookbehind (although unofficially), but this is a bug and is likely to be fixed in the next version. Thus, you may use some range, say 0 to 1000 (as the link is not likely to contain more than 1K symbols, but you may adjust it to the factual situation):

 (?<!(?:(?:https?|ftp|file)://|mailto:)(?:www\.)?\S{0,1000})(?<!\bwww\.\S{0,1000})pqr

See the regex demo

The first lookbehind (?<!(?:(?:https?|ftp|file)://|mailto:)(?:www\.)?\S{0,1000}) will check if the pqr is not preceded with a full URL, and (?<!\bwww\.\S{0,1000}) lookbehind will check if the pqr is not preceded with www..

How can one search for some text which is not part of a url?

1 Answers1