Complex RegEx replacement

Question

I'm trying to write a highlighting functionality. There are two types of highlighting: positive and negative. Positive is done first. Highlighting in itself is very simple - just wrapping keyword/phrase in a span with a specific class, which depends on type of highlighting.

Problem:
Sometimes, negative highlighting can contain positive one.

Example:
Original text:

some data from blahblah test was not statistically valid

After text passes through positive highlighting "filter", it'll end up like this:

some data from <span class="positive">blahblah test</span> was not <span class="positive">statistically valid</span>

or

some data from <span class="positive">blahblah test</span> was not <span class="positive">statistically <span class="positive">valid</span></span>

Then in negative list, we have a phrase not statistically valid.

In both cases, resulting text after passing through both "filters" should look like:

some data from <span class="positive">blahblah test</span> was <span class="negative">not statistically valid</span>

Conditions:
- Amount of span tags or their location within keyword/phrase from negative "filter" list is unknown
- Keyword/phrase must be matched even if it includes span tags (including right before and right after keyword/phrase). These span tags have to be removed.
- If any span tags are detected, amount of opening and closing span tags removed has to be equal.

Questions:
- How to detect these span tags if there are any?
- Is this even possible with RegEx alone?

What flavor of regex are you using? It's very likely an html parser should be used instead for adding the negative highlighting. — 4castle, Jul 04 '16 at 23:26
It's generally bad to use regex to parse things like HTML: see [here](http://stackoverflow.com/questions/8577060/why-is-it-such-a-bad-idea-to-parse-xml-with-regex), [here](http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg), and [here](http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la). — Pierce Darragh, Jul 05 '16 at 05:12
@pushpraj it replaces span in whole string, while I only need it to be replaced within keyword/phrase. Besides, you used wrong string :P — Igor Yavych, Jul 05 '16 at 08:38
This might just be possible: use `split` to split the string up, effectively extracting the negative expressions. Use the `PREG_SPLIT_DELIM_CAPTURE` flag to capture the expressions. Then apply positive highlighting to the remainder, and finally join using the captured expressions, with negative highlighting applied. — Chris Lear, Jul 05 '16 at 08:54
@ChrisLear can you provide some example please? It's bit confusing. — Igor Yavych, Jul 05 '16 at 09:33

revo · Accepted Answer · 2016-07-05T21:45:45.857

I don't think if it can be done with a single Regular Expression and if it's possible, then honestly I'm so lazy for blowing my mind to make it.

In what way do you think?

I came to a solution that takes 4 steps to achieve what you desire:

Extract all words from HTML and store them with their corresponding positions.
Explode each Negative List's value to words
Match each word consecutively to the (1) in-order-words and store
Replace recently found values with their new HTML wrapper (<span class="negative">...</span>) by their positions

I have, however, made a detailed flowchart (I'm not good at flowchats, sorry) that you feel better in understanding things. It could help if you look at codes at the first.

Here is what we have:

$HTML = <<< HTML
some data from <span class="positive">blahblah test</span> was not <span class="positive">statistically <span class="positive">valid</span></span>
HTML;

$listOfNegatives = ['not statistically valid'];

To extract words (real words) I used a RegEx which will fulfill our needs at this step:

~\b(?<![</])\w+\b(?![^<>]+>)~

To get positions of each word too, a flag should be used with preg_match_all(): PREG_OFFSET_CAPTURE

/**
 * Extract all words and their corresponsing positions
 * @param  [string] $HTML
 * @return [array]  $HTMLWords
 */
function extractWords($HTML) {
    $HTMLWords = [];
    preg_match_all("~\b(?<![</])\w+\b(?![^<>]+>)~", $HTML, $words, PREG_OFFSET_CAPTURE);
    foreach ($words[0] as $word) {
        $HTMLWords[$word[1]] = $word[0];
    }
    return $HTMLWords;
}

This function's output is something like this:

Array
(
    [0] => some
    [5] => data
    [10] => from
    [38] => blahblah
    [47] => test
    [59] => was
    [63] => not
    [90] => statistically
    [127] => valid
)

What we should do here is to match each words of a list's value - consecutively - to words we just extracted. So as our first list's value not statistically valid we have three words not, statistically and valid and these words should come continuously in the extracted words array. (which happens)

To handle this I wrote a function:

/**
 * Check if any of our defined list values can be found in an ordered-array of exctracted words
 * @param  [array] $HTMLWords
 * @param  [array] $listOfNegatives
 * @return [array] $subString
 */
function checkNegativesExistence($HTMLWords, $listOfNegatives) {
    $counter = 0;
    $previousWordOffset = null;
    $subStrings = [];

    foreach ($listOfNegatives as $i => $string) {
        $stringWords = explode(" ", $string);
        $wordIndex = 0;
        foreach ($HTMLWords as $offset => $HTMLWord) {
            if ($wordIndex > count($stringWords) - 1) {
                $wordIndex = 0;
                $counter++;
            }

            if ($stringWords[$wordIndex] == $HTMLWord) {
                $subStrings[$counter][] = [$HTMLWord, $offset, $previousWordOffset];
                $wordIndex++;
            } elseif (isset($subStrings[$counter]) && count($subStrings[$counter]) > 0) {
                unset($subStrings[$counter]);
                $wordIndex = 0;
            }
            $previousWordOffset = $offset + strlen($HTMLWord);
        }
        $counter++;
    }

    return $subStrings;
}

Which has an output like below:

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => not
                    [1] => 63
                    [2] => 62
                )

            [1] => Array
                (
                    [0] => statistically
                    [1] => 90
                    [2] => 66
                )

            [2] => Array
                (
                    [0] => valid
                    [1] => 127
                    [2] => 103
                )
        )
)

If you see we have a complete string split into words and their offsets (we have two offsets, first one is real offset second one is offset of previous word). We need them later.

Now another thing we should consider is to replace this occurrence from offset 62 to 127 + strlen(valid) with <span class="negative">not statistically valid</span> and forget about every thing else.

/**
 * Substitute newly matched strings with negative HTML wrapper
 * @param  [array]  $subStrings
 * @param  [string] $HTML
 * @return [string] $HTML
 */
function negativeHighlight($subStrings, $HTML) {
    $offset = 0;
    $HTMLLength = strlen($HTML);

    foreach ($subStrings as $key => $value) {
        $arrayOfWords = [];
        foreach ($value as $word) {
            $arrayOfWords[] = $word[0];
            if (current($value) == $value[0]) {
                $start = substr($HTML, $word[1], strlen($word[0])) == $word[0] ? $word[2] : $word[2] + $offset;
            }
            if (current($value) == end($value)) {
                $defaultLength = $word[1] + strlen($word[0]) - $start;
                $length = substr($HTML, $word[1], strlen($word[0])) === $word[0] ? $defaultLength : $defaultLength + $offset;
            }
        }
        $string = implode(" ", $arrayOfWords);
        $HTML = substr_replace($HTML, "<span class=\"negative\">{$string}</span>", $start, $length);

        if ($HTMLLength > strlen($HTML)) {
            $offset = -($HTMLLength - strlen($HTML));
        } elseif ($HTMLLength < strlen($HTML)) {
            $offset = strlen($HTML) - $HTMLLength;
        }
    }

    return $HTML;
}

An important thing here I should note is that by doing first substitution we may affect offsets of other extracted values (that we don't have here). So calculating new HTML length is required:

if ($HTMLLength > strlen($HTML)) {
    $offset = -($HTMLLength - strlen($HTML));
} elseif ($HTMLLength < strlen($HTML)) {
    $offset = strlen($HTML) - $HTMLLength;
}

and... we should check if by this change of length how did our offsets changed:

Word's offset is intact (some characters were added/removed after this word)
Word's offset is changed (some characters were added/removed before this word)

This checking is done by this block (we need to check first and last word only):

if (current($value) == $value[0]) {
    $start = substr($HTML, $word[1], strlen($word[0])) == $word[0] ? $word[2] : $word[2] + $offset;
}
if (current($value) == end($value)) {
    $defaultLength = $word[1] + strlen($word[0]) - $start;
    $length = substr($HTML, $word[1], strlen($word[0])) === $word[0] ? $defaultLength : $defaultLength + $offset;
}

Doing all together:

$newHTML = negativeHighlight(checkNegativesExistence(extractWords($HTML), $listOfNegatives), $HTML);

Output:

some data from <span class="positive">blahblah test</span> was <span class="negative">not statistically valid</span></span></span>

But there are problems with our last output: unmatched tags.

I'm sorry that I lied I've done this problem solving in 4 steps but it has one more. Here I made another RegEx to match all truly nested tags and those which are mistakenly existed:

~(<span[^>]+>([^<]*+<(?!/)(?:([a-zA-Z0-9]++)[^>]*>[^<]*</\3>|(?2)))*[^<]*</span>|(?'single'</[^>]+>|<[^>]+>))~

By a preg_replace_callback() I only replace tags in group named single with nothing:

echo preg_replace_callback("~(<span[^>]+>([^<]*+<(?!/)(?:([a-zA-Z0-9]++)[^>]*>[^<]*</\3>|(?2)))*[^<]*</span>|(?'single'</[^>]+>|<[^>]+>))~",
    function ($match) {
        if (isset($match['single'])) {
            return null;
        }
        return $match[1];
    },
    $newHTML
);

and we have right output:

some data from <span class="positive">blahblah test</span> was <span class="negative">not statistically valid</span>

Failing cases

My solution does not output right HTML on below situations:

1- If a word like <was> is between other words:

<span class="positive">blahblah test</span> <was> not

Why?

Because my last RegEx spots <was> as an unmatched tag so it will
remove it.

2- If a word like not (which is part of a negative list's value in our list) is enclosed with <> -> <not>. Which outputs:

some data from <span class="positive">blahblah test</span> was <not> <span class="positive">statistically <span class="positive">valid</span></span>

Why?

Because my first RegEx understands words which are not between tags specific characters <>

3- If list has values that one is the other's substring:

$listOfNegatives = ['not statistically valid', 'not statistically'];

Why?

Because they overlap.

Working demo

Oh man, to say this is an amazing answer would simply be an understatement. This has to be most awesome answer I've seen on SO so far. I'm endlessly thankful for the code and detailed explanation. Just gotta find a way around 3rd failing case (1 and 2 shouldn't happen because the only tags in the input are spans from positive filtering and p tags). Once again, thank you for taking time to write up this answer! I wish I could upvote it at least 20 times :) — Igor Yavych, Jul 05 '16 at 20:22
You're welcome. I just updated `checkNegativesExistence()` and `negativeHighlight()` functions which I found had some issues in replacements and a full working demo is added at the end. @IgorYavych — revo, Jul 05 '16 at 21:24
Do you think it's possible to solve 3rd failing case without heavy modifications? — Igor Yavych, Jul 09 '16 at 07:50
Although you shouldn't produce such a case, I think one way could be 1) specifying sub-strings 2) storing positions of previous substitutions 3) check current position on substituting an specified sub-string, so that if the position got a replacement before, skip it. Done. @IgorYavych — revo, Jul 09 '16 at 19:19

score 1 · Answer 2 · answered Jul 05 '16 at 10:04

Here is what I've come up with. I honestly can't say whether it will cope with the full range of the requirement, but it might help a bit

$s = 'some data from blahblah test was not statistically valid';

$replaced = highlight($s);

var_dump($replaced);

function highlight($s) {
    // split the string on the negative parts, capturing the full negative string each time
    $parts = preg_split('/(not statistically valid)/',$s,-1,PREG_SPLIT_DELIM_CAPTURE);
    $output = '';
    $negativePart = 0; // keep track of whether we're dealing with a negative or part or the remainder - they will alternate.

    foreach ($parts as $part) {
        if ($negativePart) {
            $output .= negativeHighlight($part);
        } else {
            $output .= positiveHighlight($part);
        }
        $negativePart = !$negativePart;
    }
    return $output;
}

// only deals with a single negative part at a time, so just wraps with a span
function negativeHighlight($part) {
    return "<span class='negative'>$part</span>";
}

// potentially deals with several replacements at once
function positiveHighlight($part) {
    return preg_replace('/(blahblah test)|(statistically valid)/', "<span class='positive'>$1</span>", $part);
}

Complex RegEx replacement

2 Answers2

In what way do you think?

Failing cases