Update function to recognize links

Question

I have this function to recognize and convert hashtag, emoji etc

function convert_text($str) {
        $regex = "/[@#](\w+)/";
    //type and links
        $hrefs = [
            '#' => 'hashtag.php?hashtag',
            '@' => 'user.php?user'
        ];

        $result = preg_replace_callback($regex, function($matches) use ($hrefs) {
             return sprintf(
                 '<a href="%s=%s">%s</a>',
                 $hrefs[$matches[0][0]],
                 $matches[1], 
                 $matches[0]
             );
        }, $str);

        //$result = preg_replace("/U\+([A-F0-9]{5})/", '\u{${1}}', $result);
        $result = preg_replace('/U\+([A-F0-9]{5})/', '<span style="font-size:30px;">&#x\\1;</span>', $result);

        return ($result);
    }

I would like to make it recognize http:// and https:// from text and than convert to:

<a href="http://link.com">http://link.com</a> how to implement this inside the function?

What's the point in *converting* emojis? They're characters already. — AmigoJack, Aug 15 '19 at 05:50
the function already recognize and convert emoji already, I would like to make it reconize and convert links `http` `https` — Otávio Barreto, Aug 15 '19 at 12:11
I saw that page while I was researching this question. It does one thing that is required, but that technique needs to be properly integrated into the OP's pre-existing script to provide the OP's desired result. — mickmackusa, Aug 28 '19 at 13:16

Emma · Answer 1 · 2019-08-15T04:41:17.527

My guess is that maybe you might want to write some expression somewhat close to,

\bhttps?:\/\/\S*\b

Demo

Match

$re = '/\bhttps?:\/\/\S*\b/s';
$str = 'some text before http://some_domain.com/some_link some text before  https://www.some_domain.com/some_link some text after';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

var_dump($matches);

Output

array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(32) "http://some_domain.com/some_link"
  }
  [1]=>
  array(1) {
    [0]=>
    string(37) "https://www.some_domain.com/some_link"
  }
}

Replace

$re = '/(\bhttps?:\/\/\S*\b)/s';
$str = 'some text before http://some_domain.com/some_link some text before  https://www.some_domain.com/some_link some text after';
$subst = '<a href="$1">$1</a>';

echo preg_replace($re, $subst, $str);

Output

some text before <a href="http://some_domain.com/some_link">http://some_domain.com/some_link</a> some text before  <a href="https://www.some_domain.com/some_link">https://www.some_domain.com/some_link</a> some text after

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

RegEx Circuit

jex.im visualizes regular expressions:

Simple and beautiful. One problem, I imagine, could be that this has unintended consequences for URLs which are followed by punctuation, as the punctuation character would be treated as part of the URL. — tarleb, Aug 19 '19 at 18:26

mickmackusa · Accepted Answer · 2019-08-21T10:25:16.147

I won't go down the rabbit hole about constructing a world-conquering regex pattern to extract all valid urls the world can dream up including unicode while denying urls with valid characters but illogical structures. (I'll go with Gumbo and move on.)

For a regex demo see: https://regex101.com/r/HFCP1Z/1/

Things to note:

If a url is matched, there is no capture group, so $m[1] isn't generated. If a user/hash tag is matched, the fullstring match and capture group 1 is generated. If an emoji is matched, the fullstring match is populated, the capture group 1 element is empty (but declared because php generates $m as an indexed array -- no gaps), and capture group 2 holds the emoji's parenthetical substring.
You need to be sure that you don't accidentally replace part of a url which contains a qualifying hashtag/usertag substring. (Currently, the other answers don't consider this vulnerability.) I am going to prevent that scenario by performing a single pass over the input and consuming whole url substrings before the other patterns get a chance at it.
(notice: http://example.com/@dave and http://example.com?asdf=1234#anchor)
There are two reason that I am declaring your hashtag/usertag lookup array as a constant.
1. It does not vary, so it needn't be a variable.
2. It enjoys global scope, so the use() syntax is not necessary inside of preg_replace_callback().
You should avoid adding inline styling to your tags. I recommend assigning a class so that you can simply update a single portion of the stylesheet when you decide to amend/extend the styling at a later time.

Code: (Demo)

define('PINGTAGS', [
        '#' => 'hashtag.php?hashtag',
        '@' => 'user.php?user'
    ]);

function convert_text($str) {
    return preg_replace_callback(
        "~(?i)\bhttps?[-\w.\~:/?#[\]@!$&'()*+,;=]+|[@#](\w+)|U\+([A-F\d]{5})~",
        function($m) {
            // var_export($m);  // see for yourself
            if (!isset($m[1])) { // url
                return sprintf('<a href="%s">%s</a>', $m[0], $m[0]);
            }
            if (!isset($m[2])) { // pingtag
                return sprintf('<a href="%s=%s">%s</a>', PINGTAGS[$m[0][0]], $m[1], $m[0]);
            }
            return "<span class=\"emoji\">&#x{$m[2]};</span>"; // emoji
        },
        $str);
}

echo convert_text(
<<<STRING
This is a @ping and a #hash.
This is a www.example.com, this is http://example.com?asdf=1234#anchor
https://www.example.net/a/b/c/?g=5&awesome=foobar# U+23232 http://www5.example.com
https://sub.sub.www.example.org/ @pong@pug#tagged
http://example.com/@dave
more http://example.com/more_(than)_one_(parens)
andU+98765more http://example.com/blah_(wikipedia)#cite-1
and more http://example.com/blah_(wikipedia)_blah#cite-1
and more http://example.com/(something)?after=parens
STRING
);

Raw Output:

This is a <a href="user.php?user=ping">@ping</a> and a <a href="hashtag.php?hashtag=hash">#hash</a>.
This is a www.example.com, this is <a href="http://example.com?asdf=1234#anchor">http://example.com?asdf=1234#anchor</a>
<a href="https://www.example.net/a/b/c/?g=5&awesome=foobar#">https://www.example.net/a/b/c/?g=5&awesome=foobar#</a> <span class="emoji">&#x23232;</span> <a href="http://www5.example.com">http://www5.example.com</a>
<a href="https://sub.sub.www.example.org/">https://sub.sub.www.example.org/</a> <a href="user.php?user=pong">@pong</a><a href="user.php?user=pug">@pug</a><a href="hashtag.php?hashtag=tagged">#tagged</a>
<a href="http://example.com/@dave">http://example.com/@dave</a>
more <a href="http://example.com/more_(than)_one_(parens)">http://example.com/more_(than)_one_(parens)</a>
and<span class="emoji">&#x98765;</span>more <a href="http://example.com/blah_(wikipedia)#cite-1">http://example.com/blah_(wikipedia)#cite-1</a>
and more <a href="http://example.com/blah_(wikipedia)_blah#cite-1">http://example.com/blah_(wikipedia)_blah#cite-1</a>
and more <a href="http://example.com/(something)?after=parens">http://example.com/(something)?after=parens</a>

Stackoverflow-Rendered Output:

This is a @ping and a #hash. This is a www.example.com, this is http://example.com?asdf=1234#anchor https://www.example.net/a/b/c/?g=5&awesome=foobar# 𣈲 http://www5.example.com https://sub.sub.www.example.org/ @pong@pug#tagged http://example.com/@dave more http://example.com/more_(than)one(parens) and򘝥more http://example.com/blah_(wikipedia)#cite-1 and more http://example.com/blah_(wikipedia)_blah#cite-1 and more http://example.com/(something)?after=parens

p.s. The hash and user tags aren't highlighted here, but they are the local links that you asked for.

If you later decide to add padding or other styling declarations, a class will not continue to bloat your html markup. A simple `class` declaration keeps your content clean and easy to manage. There are more reasons. Have a read through this: https://stackoverflow.com/q/2612483/2943403 — mickmackusa, Aug 20 '19 at 22:07

score 1 · Answer 3 · answered Aug 20 '19 at 00:09

To recognize links I would try:

function convert_text($str){
   return preg_replace_callback('/\bhttps?:\/\/[A-Z0-9+&@#\/%?=~_|$!:,.;-]*[A-Z0-9+&@#\/%=~_|$]/i', 'compute_replacement', $str);
}

function compute_replacement($groups) {
    return '<a href="$0">$0</a>';
}

Booboo · Answer 4 · 2019-08-26T13:09:25.827

My regex is not intended to validate a URL but to recognize one, the assumption being that the URL is a valid one. Let's not forget that a valid URL can have both a query string and a hashtag. The latter presents a problem since the current convert_text function looks for hashtags under the assumption they are not part of a URL. So, my regex will also assume that URL's will not contain hashtags. So, I have added an additional call to preg_replace as follows to the existing function:

function convert_text($str) {
        $regex = "/[@#](\w+)/";
    //type and links
        $hrefs = [
            '#' => 'hashtag.php?hashtag',
            '@' => 'user.php?user'
        ];

        $result = preg_replace_callback($regex, function($matches) use ($hrefs) {
             return sprintf(
                 '<a href="%s=%s">%s</a>',
                 $hrefs[$matches[0][0]],
                 $matches[1],
                 $matches[0]
             );
        }, $str);

        //$result = preg_replace("/U\+([A-F0-9]{5})/", '\u{${1}}', $result);
        $result = preg_replace('/U\+([A-F0-9]{5})/', '<span style="font-size:30px;">&#x\\1;</span>', $result);

        // the addition:
        $result = preg_replace("~\bhttps?:/(/[^/\s]+)+/?(\?[^=\s]+=[^&\s]+(&(amp;)?[^=\s]+=[^&\s]+)*)?\b~", '<a href="$0">$0</a>', $result);

        return ($result);
}

Test:

echo convert_text('#abc http://example.com https://example.com/a/b?x=1&y=2');

Prints:

<a href="hashtag.php?hashtag=abc">#abc</a> <a href="http://example.com">http://example.com</a> <a href="https://example.com/a/b?x=1&y=2">https://examle.com/a/b?x=1&y=2</a>

Cool , I Will test soon. – Otávio Barreto Aug 26 '19 at 03:07 — Otávio Barreto, Aug 26 '19 at 03:07

YOGO · Answer 5 · 2019-08-19T22:53:29.330

@tarleb no, this regex by @emma will not match that punctuation, actually will not match anything different to [a-zA-Z0-9_] in the end of the url.

RFC "Legal" characters are [%A-Za-z0-9\-_~:\/?#\]\[@!$&'()*+,;=]

So, that regex will not match valid urls ending in %.-_~:/?#][@!$&'()*+,;=, valid URLs too. So, if you want to match them but not any URL ending with . you should add:

(\bhttps?:\/\/\S{4,}(?:[-_~:\/?#\]\[@!$&'()*+,;=%]|\b))

You can also remove the , or any other to match as you please.

function convert_text($str) {
    $regex = "/[@#](\w+)/";
//type and links
    $hrefs = [
        '#' => 'hashtag.php?hashtag',
        '@' => 'user.php?user'
    ];

    $result = preg_replace_callback($regex, function($matches) use ($hrefs) {
         return sprintf(
             '<a href="%s=%s">%s</a>',
             $hrefs[$matches[0][0]],
             $matches[1], 
             $matches[0]
         );
    }, $str);

    $result = preg_replace('/(\bhttps?:\/\/\S{4,}(?:[-_~:\/?#\]\[@!$&\'()*+,;=%]|\b))/', '<a href="\1">\1</a>', $result);

    //$result = preg_replace("/U\+([A-F0-9]{5})/", '\u{${1}}', $result);
    $result = preg_replace('/U\+([A-F0-9]{5})/', '<span style="font-size:30px;">&#x\\1;</span>', $result);

    return ($result);
}

Demo

do you think I would have a `var` conflict because of using $result 2 times in the function? — Otávio Barreto, Aug 20 '19 at 00:05
This solution could be vulnerable to JavaScript code injection — YOGO, Oct 02 '19 at 14:33

Update function to recognize links

5 Answers5

Demo

Match

Output

Replace

Output

RegEx Circuit