This is normally were someone advises to use a HTML parser. And indeed that would be more reliable for the task. Usually QueryPath or phpQuery are also easier on the eyes:
$p = phpQuery::newDocumentHTML($h);
$p->find("p")->not("span")->wrap("span");
But in this case I failed. It's a bit of a black art if you don't know all magic jQuery selectors (and phpQuery doesn't have em all anyway). Your case is difficult since you want to work on individual text nodes. Hence you would actually have to use DOMDocument to scan the document individually. It's certainly doable, but too much API overhead for me. :}
So I skipped right to the regex approach, which with a lot of cautiosness would be workable in fact:
print preg_replace(
'#(?<!<span)>(\s*)(\w[\w,.\h]+)(\s*)<#',
'>$1<span>$2</span>$3<',
$html);
The actual trick is the lookbehind assertion (?<!<span)
so it won't match text that is already wrapped in spans. It looks more confusing because I made it match whitespace \s
and horizontal \h
spaces individually and included it in a nicer output structure. You'll have to adapt [\w,.\h]
to include all possible extra characters in the last line. This is where the regex approach shows its weakness - you cannot allow it to match <
or >
. And if your text strings are actually paragraphs, you'll have to undo the \s and \h separation..
So again, works for your simple case. But DOM approaches are usually more reliable.