Replacing text node of HTML input in PHP

Question

I want to replace all the text nodes in a html text. I'll explain with an example: $html = "

<div>
    <p>
        text2 text2 word text2
        <span>abcd</span>
        text2 text2 word text2
    <p>
    this is a long, very long statement with punctuations.
</div>

I want to replace "text2 text2 word text2" with "<span>text2 text2 word text2</span>" and "this is a long, very long statement with punctuations." with "<span>this is a long, very long statement with punctuations.</span>"

What should be the regular expression for the same?

score 0 · Answer 1 · edited May 23 '17 at 11:47

This is normally were someone advises to use a HTML parser. And indeed that would be more reliable for the task. Usually QueryPath or phpQuery are also easier on the eyes:

$p = phpQuery::newDocumentHTML($h);
$p->find("p")->not("span")->wrap("span");

But in this case I failed. It's a bit of a black art if you don't know all magic jQuery selectors (and phpQuery doesn't have em all anyway). Your case is difficult since you want to work on individual text nodes. Hence you would actually have to use DOMDocument to scan the document individually. It's certainly doable, but too much API overhead for me. :}

So I skipped right to the regex approach, which with a lot of cautiosness would be workable in fact:

 print preg_replace(
     '#(?<!<span)>(\s*)(\w[\w,.\h]+)(\s*)<#',
     '>$1<span>$2</span>$3<',
     $html);

The actual trick is the lookbehind assertion (?<!<span) so it won't match text that is already wrapped in spans. It looks more confusing because I made it match whitespace \s and horizontal \h spaces individually and included it in a nicer output structure. You'll have to adapt [\w,.\h] to include all possible extra characters in the last line. This is where the regex approach shows its weakness - you cannot allow it to match < or >. And if your text strings are actually paragraphs, you'll have to undo the \s and \h separation..

So again, works for your simple case. But DOM approaches are usually more reliable.

Replacing text node of HTML input in PHP

1 Answers1