0

I have this:

[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&\/\/=]*)

It matches for:

www.example.com 
http://example.com.nz 
example.com
http://www.example.com?2rjl6
example.com/first/second
https://example.us.edi?34535/534534?dfg=g&fg

etc...

I want no match if any of the above URLs are enclosed in square brackets [ ] like this:

[www.example.com]
[http://example.com.nz]
etc...

The text is long and may or may not contain more than one URL, spaces, line breaks, and so on.

e.g.

Lorem ipsum dolor sit amet, consectetur [http://example.com.nz] llamcorper et lacus. Morbi sodales convallis lectus a efficitur: example.com/first/second vitae nisl placerat.

Fusce non ipsum a augue http://example.com.nz http://www.example.com?2rjl6 aculis augue. Nullam eu nulla lectus.

In this case there should be only 3 matches.

I tried adding:

(?![^\[]*\])

But it doesn't work as expected.

Can you help me with this or recommend another approach? Thanks.

Dr_Kramer
  • 3
  • 1

1 Answers1

0

You can match from an opening till closing square bracket, and then make use of SKIP FAIL using php.

You might also shorten the pattern a bit. You have the whole first part in a character class [(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=] but you can move the square bracket to before the [a-z,

And as you can write a-zA-Z0-9_ as \w, you can shorten the character class a bit starting with [\w

If you choose a different delimiter than / like ~ you don't have to escape the backslash.

\[[^][]*](*SKIP)(*F)|(?:https?://)?(?:www\.)?[\w@:%.+~#=]{2,256}\.[a-z]{2,6}\b[\w-@:%+.~#?&/=]*

Explanation

  • \[[^][]*] Match from [...]
  • (*SKIP)(*F) Skip the match
  • | Or
  • (?:https?://)? Optionally match the protocol
  • (?:www\.)? Optionally match www.
  • [\w@:%.+~#=]{2,256} Repeat 2-256 times any of the listed in the character class
  • \.[a-z]{2,6}\b match a dot and 2-6 chars a-z followed by a word boundary
  • [\w-@:%+.~#?&/=]* Optionally match what is listed in the character class

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • 1
    Works like a charm, thanks. Just one thing, in PCRE it works without touching anything but in PCR2 I have to escape `-` in `[\w-@:%+.~#?&/=]*` otherwise it gives me an error regardless of the delimiter. – Dr_Kramer Aug 08 '22 at 21:45
  • @Dr_Kramer You are correct about that, the `-` should be at the start or at the end of the character class in that that case. – The fourth bird Aug 08 '22 at 21:47