1

I'm using phpBB3 to make a message board. There is a built in feature that takes all URLs in posts and renders then as links. I want to make it so that ONLY local links are made clickable.

phpbb3 uses regex on the text of a post and for each match changes it to a link:

if ($somestuff){
// matches a xxxx://aaaaa.bbb.cccc. ...
$magic_url_match[] = '#(^|[\n\t (>.])(' . "[a-z]$scheme*:/{2}(?:(?:[a-z0-9\-._~!$&'($inline*+,;=:@|]+|%[\dA-F]{2})+|[0-9.]+|\[[a-z0-9.]+:[a-z0-9.]+:[a-z0-9.:]+\])(?::\d*)?(?:/(?:[a-z0-9\-._~!$&'($inline*+,;=:@|]+|%[\dA-F]{2})*)*(?:\?(?:[a-z0-9\-._~!$&'($inline*+,;=:@/?|]+|%[\dA-F]{2})*)?(?:\#(?:[a-z0-9\-._~!$&'($inline*+,;=:@/?|]+|%[\dA-F]{2})*)?" . ')#ie';
$magic_url_replace[] = "make_clickable_callback(MAGIC_URL_FULL, '\$1', '\$2', '', '$class')";

// matches a "www.xxxx.yyyy[/zzzz]" kinda lazy URL thing
$magic_url_match[] = '#(^|[\n\t (>])(' . "www\.(?:[a-z0-9\-._~!$&'($inline*+,;=:@|]+|%[\dA-F]{2})+(?::\d*)?(?:/(?:[a-z0-9\-._~!$&'($inline*+,;=:@|]+|%[\dA-F]{2})*)*(?:\?(?:[a-z0-9\-._~!$&'($inline*+,;=:@/?|]+|%[\dA-F]{2})*)?(?:\#(?:[a-z0-9\-._~!$&'($inline*+,;=:@/?|]+|%[\dA-F]{2})*)?" . ')#ie';
$magic_url_replace[] = "make_clickable_callback(MAGIC_URL_WWW, '\$1', '\$2', '', '$class')";
}
return preg_replace($magic_url_match, $magic_url_replace, $text);

How can I rewrite these regex so that they only match links on my domain? Additionally, what is the best way to teach myself regex?

SomeKittens
  • 38,868
  • 19
  • 114
  • 143
pg.
  • 2,503
  • 4
  • 42
  • 67
  • Holy Molly!!! That's what I call **REGEX** - [THIS](http://i0.kym-cdn.com/photos/images/original/000/199/693/disgusted-mother-of-god.png?1321272571) – Zoltan Toth Aug 16 '12 at 22:34
  • I didn't say I liked the REGEX! – pg. Aug 16 '12 at 23:04
  • If I even could figure out which part of the regex is for the http://, which part is the www and which part is the domain name, I could do it I think. – pg. Aug 16 '12 at 23:18
  • * See also [Open source RegexBuddy alternatives](http://stackoverflow.com/questions/89718/is-there) and [Online regex testing](http://stackoverflow.com/questions/32282/regex-testing) for some helpful tools, or [RegExp.info](http://regular-expressions.info/) for a nicer tutorial. – mario Aug 16 '12 at 23:48
  • I'm working my way through the first one now. Can I just comment that someone is using `/{2}` instead of `//` for the double slash after the protocol? Are they insane? – KRyan Aug 17 '12 at 03:21

1 Answers1

2

This is the first one, broken up section by section. Even doing this was non-trivial...

(
    ^
|
    [\n\t (>.]
)

OK, here we simply have "beginning of the line, or after a newline, tab, space, greater than, period. Just anchoring the regex.

(
    [a-z]$scheme*:/{2}

This is pure insanity right here. $scheme presumably holds http, which means that this regex matches the http://. Why someone would use /{2} instead of //, I cannot begin to guess.

    (?:
        (?:
            [a-z0-9\-._~!$&'($inline*+,;=:@|]+
        |
            %[\dA-F]{2}
        )+
    |

This matches a series of characters, presumably those that are legal in a URL. Of note is the $inline PHP variable – can't guess what that holds – and the second alternative, %[\dA-F]{2}. That matches things like %20 for a space, etc. The % sign is not otherwise legal in the match (or in a URL).

Also important here is that / is not legal. This, therefore, cannot refer to directories, only to the domain. This is most likely the part you want to change, to simply match the appropriate domain of your website.

For completeness's sake, though, here's the rest.

        [0-9.]+
    |

Alternatively, we could have a series of digits and periods – an IP address. Considering how complicated this regex is, I'm surprised he didn't go for (?:\d{1,3}\.){3}\d{1,3}...

        \[
        [a-z0-9.]+
        :
        [a-z0-9.]+
        :
        [a-z0-9.:]+
        \]
    )

Here's our last alternative; I think this is for IPv6. It's a series of hexadecimal numbers separated by colons, anyway. It requires that these be within square brackets, which I find odd, especially for a forum software that uses those so heavily for tags...

    (?:
        :
        \d*
    )?

Here, we get the option of some digits following a colon. That is, this is for URLs that have a port in them.

    (?:
        /
        (?:
            [a-z0-9\-._~!$&'($inline*+,;=:@|]+
        |
            %[\dA-F]{2}
        )*
    )*

OK, here we've gotten to the subdirectories, as shown by the / at the beginning. Otherwise, this is the same "legal URL characters" match.

    (?:
        \?
        (?:
            [a-z0-9\-._~!$&'($inline*+,;=:@/?|]+
        |
            %[\dA-F]{2}
        )*
    )?
    (?:
        \#
        (?:
            [a-z0-9\-._~!$&'($inline*+,;=:@/?|]+
        |
            %[\dA-F]{2}
        )*
    )?
)

Finally, things that are being passed by GET, indicated by the \?, and URLs linking to a mid-page anchor, indicated by the \#.

Bottom line:

This section:

    [a-z]$scheme*:/{2}
    (?:
        (?:
            [a-z0-9\-._~!$&'($inline*+,;=:@|]+
        |
            %[\dA-F]{2}
        )+
    |
        [0-9.]+
    |
        \[
        [a-z0-9.]+
        :
        [a-z0-9.]+
        :
        [a-z0-9.:]+
        \]
    )

Should be replaced with something like this:

    [a-z]$scheme*://
    www\.example\.com

Or maybe

    [a-z]$scheme*://
    (?:
        www\.example\.com
    |
        192\.168\.0\.1
    |
        ::ffff:192\.168\.0\.1
    )

Where the domain and the IP addresses match your website. Obviously, you're going to have to remove the line breaks and indentation I did. I'd do it for you, but I think it's almost not worth it because you'll have a hard time finding the spot where you put your domain in the middle of all that.

You'll probably want to include some regex for subdomains or people leaving out the www. or what have you.

You may also want to remove this:

    (?:
        :
        \d*
    )?

As you probably don't want people linking to other ports on your domain.

The second one looks to have roughly the same structure; as the comment says, it's just getting URLs that lack the protocol designator.

Community
  • 1
  • 1
KRyan
  • 7,308
  • 2
  • 40
  • 68