0

So I've this regex (from https://github.com/savetheinternet/Tinyboard/blob/master/inc/functions.php#L1620)

((?:https?:\/\/|ftp:\/\/|irc:\/\/)[^\s<>()"]+?(?:\([^\s<>()"]*?\)[^\s<>()"]*?)*)((?:\s|<|>|"|\.|\]|!|\?|,|&#44;|&quot;)*(?:[\s<>()"]|$))

it works for matching links like: http://stackoverflow.com/ etc..

question is, how I can exclude these kind of markup matches (mainly the url ja img parts):

[url]http://stackoverflow.com/[/url]
[url=http://stackoverflow.com/]http://stackoverflow.com/[/url]
[img]http://cdn.sstatic.net/stackoverflow/img/sprites.png[/img]
[img=http://cdn.sstatic.net/stackoverflow/img/sprites.png]
Tunaki
  • 132,869
  • 46
  • 340
  • 423

2 Answers2

3

To exclude this you can add at the begining of your expression this subpattern:

(?:\[(url|img)](?>[^[]++|[(?!\/\g{-1}))*+\[\/\g{-1}]|\[(?:url|img)=[^]]*+])(*SKIP)(*FAIL)|your pattern here

The goal of this is to try to match the parts you don't want before and forces the regex engine to fail with the backtracking control verb (*FAIL). The (*SKIP) verb forces the regex engine to not retry the substring matched before when the subpattern fails after.

You can find more informations about these features here.

Notice: assuming that you are using PHP for this pattern, you can improve a little bit this very long pattern by replacing the default delimiter / by ~ to avoid to escape all / in the pattern and by using the verbose mode (x modifier) with a Nowdoc syntax. Like this you can comment it, make it more readable and easily improve the pattern

Example:

$pattern = <<<'EOF'
~
### skipping url and img bbcodes ###
(?:
    \[(url|img)]              # opening bbcode tag
    (?>[^[]++|[(?!/\g{-1}))*+ # possible content between tags
    \[/\g{-1}]                # closing bbcode tag
  |
    \[(?:url|img)= [^]]*+ ]   # self closing bbcode tags
)(*SKIP)(*FAIL)            # forces to fail and skip

| # OR

### a link ###
(
    (?:https?|ftp|irc)://      # protocol
    [^\s<>()"]+?
    (?:
        \( [^\s<>()"]*? \)     # part between parenthesis
        [^\s<>()"]*?
    )*
)
(
    (?:[]\s<>".!?,]|&#44;|&quot;)*
    (?:[\s<>()"]|$)
)
~x
EOF;
Community
  • 1
  • 1
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
0

You could solve it with negative look-behind assertion.

(?<!pattern)

In your case, you can check if there is no ] or = character just before the matching link. Below regex will make sure that exactly this doesn't happen:

(?<!(?:\=|\]))((?:https?:\/\/|ftp:\/\/|irc:\/\/)[^\s<>()"]+?(?:\([^\s<>()"]*?\)[^\s<>()"]*?)*)((?:\s|<|>|"|\.|\]|!|\?|,|&#44;|&quot;)*(?:[\s<>()"]|$))

Note that the only part added is (?<!(?:\=|\])) right in the beginning and that it will not match a link in something like <a href=http://example.com> but your question does not specify this... so impove the question if that's expected behaviour or work it out yourself using negative look behind.

Michal Gasek
  • 6,173
  • 1
  • 18
  • 20