Regex to match URLs in a text which are not part of an html tag

Question

I am using the following regex to replace plain URLs with html links in a text:

preg_replace('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', '<a href="$1" target="_blank">$1</a> ', $text_msg);

Now I want to modify the regex in a way that, it only replaces the URL only if there is no double quotes behind it and therefore is not part of a tag (i.e. the url is at the start of the string, start of a line or after a space).

Examples:

This is the link <a href="http://test.com"> ... (URL should not be replaced)
http://test.com (at the begenning of a line or the whole multi-line string should be replaced)
This is the site: http://test.com (URL should be replaced)

Thanks.

Also, don't provide irrelevant code. The code in question has nothing to do with your current problem. You're merely showing us code that solved a previous problem you had. Instead show us the code you tried to solve your current problem and tell us how didn't do what you wanted it to do. — Sherif, May 02 '20 at 08:14
To simplify, your actual problem here is separating the text from the HTML, not the parsing of the URL (you already got that part covered). To do former, simply use something like [`DOMDocument`](http://php.net/domdocument) which is an HTML parser, capable of extracting the text nodes from the DOM, and run your regex on that text instead. — Sherif, May 02 '20 at 08:18
@sherif the question you suggested is similar but does not answer my question. I need to replace if and only if there is no double quote behind the URL. The code is relevant because I want it to be changed. — wmac, May 02 '20 at 08:51
No, you don't want to replace it only if there is no double quotes behind the URL. What you want is to replace plain text URLs inside of the DOM as HTML. See my answer below for full details. — Sherif, May 02 '20 at 08:53

score 0 · Answer 1 · answered May 02 '20 at 08:50

Your question actually breaks down into two smaller problems. You've already solved one of them, which is parsing the URL with a regular expression. The second part is extracting text from HTML, which isn't easily solved by a regular expression at all. The confusion you have is in trying to do both at the same with a regular expression (parsing HTML and parsing the URL). See the parsing HTML with regex SO Answer for more details on why this is a bad idea.

So instead, let's just use an HTML parser (like DOMDocument) to extract text nodes from the HTML and parse URLs inside those text nodes.

Here's an example

<?php
$html = <<<'HTML'
    <p>This is a URL http://abcd/ims in text</p>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

// Let's walk the entire DOM tree looking for text nodes
function walk(DOMNode $node, $skipParent = false) {
    if (!$skipParent) {
        yield $node;
    }
    if ($node->hasChildNodes()) {
        foreach ($node->childNodes as $n) {
            yield from walk($n);
        }
    }
}

foreach (walk($dom->firstChild) as $node) {
    if ($node instanceof DOMText) {
        // lets find any links and change them to HTML
        if (preg_match('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', $node->nodeValue, $match)) {

            $node->nodeValue = preg_replace('/(http[s]{0,1}\:\/\/\S{4,})\s{0,}/ims', "\xff ",
                                            $node->nodeValue);
            $nodeSplit = explode("\xff", $node->nodeValue, 2);
            $node->nodeValue = $nodeSplit[1];
            $newNode = $dom->createTextNode($nodeSplit[0]);
            $href = $dom->createElement('a', $match[1]);
            $href->setAttribute('href', $match[1]);
            $node->parentNode->insertBefore($newNode, $node);
            $node->parentNode->insertBefore($href, $node);
        }
    }
}

echo $dom->saveHTML();

Which gives you the desired HTML as output:

<p>This is a URL <a href="http://abcd/ims">http://abcd/ims</a> in text</p>

Regex to match URLs in a text which are not part of an html tag

1 Answers1