I don't think if it can be done with a single Regular Expression and if it's possible, then honestly I'm so lazy for blowing my mind to make it.
In what way do you think?
I came to a solution that takes 4 steps to achieve what you desire:
- Extract all words from HTML and store them with their corresponding
positions.
- Explode each Negative List's value to words
- Match each word consecutively to the (1) in-order-words and store
- Replace recently found values with their new HTML wrapper (
<span class="negative">...</span>
) by their positions
I have, however, made a detailed flowchart (I'm not good at flowchats, sorry) that you feel better in understanding things. It could help if you look at codes at the first.

Here is what we have:
$HTML = <<< HTML
some data from <span class="positive">blahblah test</span> was not <span class="positive">statistically <span class="positive">valid</span></span>
HTML;
$listOfNegatives = ['not statistically valid'];
To extract words (real words) I used a RegEx which will fulfill our needs at this step:
~\b(?<![</])\w+\b(?![^<>]+>)~
To get positions of each word too, a flag should be used with preg_match_all()
: PREG_OFFSET_CAPTURE
/**
* Extract all words and their corresponsing positions
* @param [string] $HTML
* @return [array] $HTMLWords
*/
function extractWords($HTML) {
$HTMLWords = [];
preg_match_all("~\b(?<![</])\w+\b(?![^<>]+>)~", $HTML, $words, PREG_OFFSET_CAPTURE);
foreach ($words[0] as $word) {
$HTMLWords[$word[1]] = $word[0];
}
return $HTMLWords;
}
This function's output is something like this:
Array
(
[0] => some
[5] => data
[10] => from
[38] => blahblah
[47] => test
[59] => was
[63] => not
[90] => statistically
[127] => valid
)
What we should do here is to match each words of a list's value - consecutively - to words we just extracted. So as our first list's value not statistically valid
we have three words not
, statistically
and valid
and these words should come continuously in the extracted words array. (which happens)
To handle this I wrote a function:
/**
* Check if any of our defined list values can be found in an ordered-array of exctracted words
* @param [array] $HTMLWords
* @param [array] $listOfNegatives
* @return [array] $subString
*/
function checkNegativesExistence($HTMLWords, $listOfNegatives) {
$counter = 0;
$previousWordOffset = null;
$subStrings = [];
foreach ($listOfNegatives as $i => $string) {
$stringWords = explode(" ", $string);
$wordIndex = 0;
foreach ($HTMLWords as $offset => $HTMLWord) {
if ($wordIndex > count($stringWords) - 1) {
$wordIndex = 0;
$counter++;
}
if ($stringWords[$wordIndex] == $HTMLWord) {
$subStrings[$counter][] = [$HTMLWord, $offset, $previousWordOffset];
$wordIndex++;
} elseif (isset($subStrings[$counter]) && count($subStrings[$counter]) > 0) {
unset($subStrings[$counter]);
$wordIndex = 0;
}
$previousWordOffset = $offset + strlen($HTMLWord);
}
$counter++;
}
return $subStrings;
}
Which has an output like below:
Array
(
[0] => Array
(
[0] => Array
(
[0] => not
[1] => 63
[2] => 62
)
[1] => Array
(
[0] => statistically
[1] => 90
[2] => 66
)
[2] => Array
(
[0] => valid
[1] => 127
[2] => 103
)
)
)
If you see we have a complete string split into words and their offsets (we have two offsets, first one is real offset second one is offset of previous word). We need them later.
Now another thing we should consider is to replace this occurrence from offset 62
to 127 + strlen(valid)
with <span class="negative">not statistically valid</span>
and forget about every thing else.
/**
* Substitute newly matched strings with negative HTML wrapper
* @param [array] $subStrings
* @param [string] $HTML
* @return [string] $HTML
*/
function negativeHighlight($subStrings, $HTML) {
$offset = 0;
$HTMLLength = strlen($HTML);
foreach ($subStrings as $key => $value) {
$arrayOfWords = [];
foreach ($value as $word) {
$arrayOfWords[] = $word[0];
if (current($value) == $value[0]) {
$start = substr($HTML, $word[1], strlen($word[0])) == $word[0] ? $word[2] : $word[2] + $offset;
}
if (current($value) == end($value)) {
$defaultLength = $word[1] + strlen($word[0]) - $start;
$length = substr($HTML, $word[1], strlen($word[0])) === $word[0] ? $defaultLength : $defaultLength + $offset;
}
}
$string = implode(" ", $arrayOfWords);
$HTML = substr_replace($HTML, "<span class=\"negative\">{$string}</span>", $start, $length);
if ($HTMLLength > strlen($HTML)) {
$offset = -($HTMLLength - strlen($HTML));
} elseif ($HTMLLength < strlen($HTML)) {
$offset = strlen($HTML) - $HTMLLength;
}
}
return $HTML;
}
An important thing here I should note is that by doing first substitution we may affect offsets of other extracted values (that we don't have here). So calculating new HTML length is required:
if ($HTMLLength > strlen($HTML)) {
$offset = -($HTMLLength - strlen($HTML));
} elseif ($HTMLLength < strlen($HTML)) {
$offset = strlen($HTML) - $HTMLLength;
}
and... we should check if by this change of length how did our offsets changed:
- Word's offset is intact (some characters were added/removed after
this word)
- Word's offset is changed (some characters were added/removed before
this word)
This checking is done by this block (we need to check first and last word only):
if (current($value) == $value[0]) {
$start = substr($HTML, $word[1], strlen($word[0])) == $word[0] ? $word[2] : $word[2] + $offset;
}
if (current($value) == end($value)) {
$defaultLength = $word[1] + strlen($word[0]) - $start;
$length = substr($HTML, $word[1], strlen($word[0])) === $word[0] ? $defaultLength : $defaultLength + $offset;
}
Doing all together:
$newHTML = negativeHighlight(checkNegativesExistence(extractWords($HTML), $listOfNegatives), $HTML);
Output:
some data from <span class="positive">blahblah test</span> was <span class="negative">not statistically valid</span></span></span>
But there are problems with our last output: unmatched tags.
I'm sorry that I lied I've done this problem solving in 4 steps but it has one more. Here I made another RegEx to match all truly nested tags and those which are mistakenly existed:
~(<span[^>]+>([^<]*+<(?!/)(?:([a-zA-Z0-9]++)[^>]*>[^<]*</\3>|(?2)))*[^<]*</span>|(?'single'</[^>]+>|<[^>]+>))~
By a preg_replace_callback()
I only replace tags in group named single
with nothing:
echo preg_replace_callback("~(<span[^>]+>([^<]*+<(?!/)(?:([a-zA-Z0-9]++)[^>]*>[^<]*</\3>|(?2)))*[^<]*</span>|(?'single'</[^>]+>|<[^>]+>))~",
function ($match) {
if (isset($match['single'])) {
return null;
}
return $match[1];
},
$newHTML
);
and we have right output:
some data from <span class="positive">blahblah test</span> was <span class="negative">not statistically valid</span>
Failing cases
My solution does not output right HTML on below situations:
1- If a word like <was>
is between other words:
<span class="positive">blahblah test</span> <was> not
Why?
- Because my last RegEx spots
<was>
as an unmatched tag so it will
remove it.
2- If a word like not
(which is part of a negative list's value in
our list) is enclosed with <>
-> <not>
. Which outputs:
some data from <span class="positive">blahblah test</span> was <not> <span class="positive">statistically <span class="positive">valid</span></span>
Why?
- Because my first RegEx understands words which are not between tags
specific characters
<>
3- If list has values that one is the other's substring:
$listOfNegatives = ['not statistically valid', 'not statistically'];
Why?
Working demo