How to add restriction in regex

Question

I have a Regex function that allows me to replace a word in a text at occurrence X. I try to add the condition, do not replace if the word is in a tag <h1>,<h2>,<h3> and in the image alt beacon. Could someone help me edit the function to add this condition please.

public function str_ireplace_n($search, $replace, $subject, $occurrence)
{
    $search = preg_quote($search);
    return preg_replace("/^((?:(?:.*?$search){" . --$occurrence . "}.*?))$search/i", "$1$replace", $subject);
}

Exemple :

$text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1> Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et <h2>Lorem ipsum dolor sit</h2> justo non quam laoreet euismod. Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum."

// I replace the second Lorem in this text by a link
$text = $this->str_ireplace_n('Lorem', ' <a href="' . $domain . '" alt="">Lorem</a> ', $text, 2); //2 for the second occurence

//The result will add a link on the Lorem inside the <h1> and I want to avoid this.
//I want the Regex do nothing in the case where the keyword is in h1 h2 or alt of image

I don't choose the "Lorem" I want to replace, the occurence is random. I have to make sure I don't do anything when the occurence is on a <h1>/<h2> or an image alt.

Thank's in advance

can you give some example .(try to update the answer instead of answer here please) — Yassine CHABLI, Apr 16 '19 at 17:16
I added an example but someone asked me something in comment I just answered and he removed the comment. — Pixel, Apr 16 '19 at 17:27
It's not trivial, consider replacing `sit`. I doubt regex alone can do it. — ArtisticPhoenix, Apr 16 '19 at 17:32
Possible duplicate of [Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms](https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la) — miken32, Apr 16 '19 at 19:04

ArtisticPhoenix · Accepted Answer · 2019-04-16T18:32:22.963

Personally I would use something like preg_split first:

$string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1> Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et <h2>Lorem ipsum dolor sit</h2> justo non quam laoreet euismod. Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum.';

$split = preg_split('/(<[^\/]+(?:\/|<\/[^>]+)>)/', $string, null, PREG_SPLIT_DELIM_CAPTURE);

Which gives you this (this is the basic thing we need to do):

Array
(
    [0] => Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
    [1] => <h1>Lorem ipsum dolor sit</h1>
    [2] =>  Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et 
    [3] => <h2>Lorem ipsum dolor sit</h2>
    [4] =>  justo non quam laoreet euismod. Ut eget dapibus ligula. 
    [5] => <img src="url" alt="Lorem ipsum dolor sit"/>
    [6] =>  Vestibulum vestibulum.
)

Now we have segregated those items inside tags. So now we can loop over this set and check that the leading character is or is not < and have an idea if it's inside / outside a tag. This should work as long as your tags end in </...> or />.

Basically the HTML tags + content become the delimiter, which we also capture.

The point is a simple Regex is not capable of parsing HTML as it's not a regular language. So we have to do some work in PHP to tie it all together. We can break it down and simplify the problem with a simple Regex, as I have done here.

$subject = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1> Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et <h2>Lorem ipsum dolor sit</h2> Lorem justo non quam laoreet euismod. Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum.';

//word to replace
$search = 'Lorem';
//stuff to replace with
$replace = '<a href="Lorem">foo</a>';
 //what match to replace
$occurrence = 2;

function str_ireplace_n($search, $replace, $subject, $occurrence){
    $search = preg_quote($search);

    //separate the HTML from the "body" text
    $split = preg_split('/(<(?:h1|h2|h3|img)[^\/]+(?:\/|<\/[^>]+)>)/', $subject, null, PREG_SPLIT_DELIM_CAPTURE);
    //the number of current matches
    $match = 0;

    foreach($split as &$s){
        //if strpos < is 0 it's the first character - meaning its part of HTML (we don't want that)
        //if it matches search 
        if(0 !== strpos($s,'<') && preg_match('/\b'.$search.'\b/i', $s)){
            //increment the match counter
            ++$match;
             //replace the match if it's the nth one
            if($match == $occurrence)  $s = preg_replace('/\b'.$search.'\b/i',$replace,$s);
        }
    }

    return implode($split);
}

echo str_ireplace_n($search, $replace, $subject, $occurrence);

Output:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. <h1>Lorem ipsum dolor sit</h1> 
 Proin libero erat, malesuada eget volutpat vitae, efficitur vitae ipsum. Vivamus et 
  <h2>Lorem ipsum dolor sit</h2> <a href="Lorem">foo</a> justo non quam laoreet euismod. 
  Ut eget dapibus ligula. <img src="url" alt="Lorem ipsum dolor sit"/> Vestibulum vestibulum.

This being the replaced part <a href="Lorem">foo</a>

I added a few line returns for readability (in the output), and another "Lorem" (in the input) as there was no second one outside of the HTML tags to match on. In any case if you notice, nothing within the HTML tags was modified. And in this case only the second match was changed.

It's not 100% clear exactly what you need (as is often the case with these types of questions) so I try to explain how to do instead of just doing it.

Sandbox

Thank you for your answer but I can't do this way, I don't choose the "Lorem" I want to replace, the occurence is random. I have to make sure I don't do anything when the occurence is on a
or an image alt. — Pixel, Apr 16 '19 at 18:00
Obviously you can change "Lorem" to whatever you want. and this `(<[^\/]+(?:\/|<\/[^>]+)>)` can be `(<(?:h1|h2|h3|img)[^\/]+(?:\/|<\/[^>]+)>)` — ArtisticPhoenix, Apr 16 '19 at 18:03
@Pixel - `I have to make sure I don't do anything when the occurence is on a
or an image alt` Which is exactly why you need to tokenize them first. So you can avoid anything that is part of a HTML tag ( begins with `<` ) — ArtisticPhoenix, Apr 16 '19 at 18:15

How to add restriction in regex

1 Answers1

or an image alt.

or an image alt` Which is exactly why you need to tokenize them first. So you can avoid anything that is part of a HTML tag ( begins with `<` )