There's unfortunately no magic character class or trick in php regex (that I know of) that could solve this out of the box. I've instead opted for another route:
$search = '+ fête foret ca rentrée w0w !!!';
$text = 'La paix fêtée avec plus de 40 cultures dans une forêt. Ça commence bien devant la rentrée...<br> Il répond: w0w tros cool!!! En + il fait chaud!';
$left_token = '<b>';
$right_token = '</b>';
$encoding = 'UTF-8';
// Let's normalize both search and needle
$search_normalized = normalize($search);
$text_normalized = normalize($text);
// Fixed preg_quote() and match UTF whitespaces
$search_needles = preg_split('/\s+/u', $search_normalized);
// We'll save the output in a separate variable
$text_output = $text;
// Since we made the tokens a variable, we'll need to calculate the offsets
$offset_size = strlen($left_token . $right_token);
// Start searching
foreach($search_needles as $needle) {
// Reset for each word
$search_offset = 0;
// We may have several occurences
while(true) {
if($search_offset > mb_strlen($text_normalized)) { // No more needles
break;
} else {
$pos = mb_stripos($text_normalized, $needle, $search_offset, $encoding);
}
if($pos === false) { // No more needles here
break;
}
$len = mb_strlen($needle);
// Insert tokens
$text_output = mb_substr($text_output, 0, $pos, $encoding) . // Left side
$left_token .
mb_substr($text_output, $pos, $len, $encoding) . // The enclosed word
$right_token .
mb_substr($text_output, $pos + $len, NULL, $encoding); // Right side
// We need to update this too otherwise the positions won't be the same
$text_normalized = mb_substr($text_normalized, 0, $pos, $encoding) . // Left side
$left_token .
mb_substr($text_normalized, $pos, $len, $encoding) . // The enclosed word
$right_token .
mb_substr($text_normalized, $pos + $len, NULL, $encoding); // Right side
// Advance in the search
$search_offset = $pos + $len + $offset_size;
}
}
echo($text_output);
var_dump($text_output);
// Credits: http://stackoverflow.com/a/10064701
function normalize($input) {
$normalizeChars = array(
'Š'=>'S', 'š'=>'s', 'Ð'=>'Dj','Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A',
'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E', 'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I',
'Ï'=>'I', 'Ñ'=>'N', 'Ń'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U',
'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss','à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a',
'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i',
'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ń'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u',
'ú'=>'u', 'û'=>'u', 'ü'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y', 'ƒ'=>'f',
'ă'=>'a', 'î'=>'i', 'â'=>'a', 'ș'=>'s', 'ț'=>'t', 'Ă'=>'A', 'Î'=>'I', 'Â'=>'A', 'Ș'=>'S', 'Ț'=>'T',
);
return strtr($input, $normalizeChars);
}
Basically:
- Normalize: Convert needle and haystack to normal ASCII characters.
- Find position: Search for the position of the normalized needle in the normalized haystack.
- Insert: Insert the opening and closing tag accordingly into the original string.
- Repeat: Sometimes you may have several occurrences. This process is repeated until no occurrence is left.
Sample output:
La paix <b>fêté</b>e avec plus de 40 cultures dans une <b>forêt</b>. <b>Ça</b> commence bien devant la <b>rentrée</b>...<br> Il répond: <b>w0w</b> tros cool<b>!!!</b> En <b>+</b> il fait chaud!