I know it's an old question, but after searching a lot, I've found no correct answer. So here is the way I do that in Portuges and I think this will work on other languages.
For US or English reader here is the description of the problem.
Let's say we have a sentence like "... sobre formação, os cursos são rápidos."
We have "ç" and "ã" in "formação" and a "á" in "rápidos".
If we use SQL Match, we will see that this sentence is seen as "correct result" when we search for "formação" but also "Formação", meaning the Match is case incencitive.
When we want then to higlght we can use a regexp like this one:
$sentence = preg_replace("/($str_regexp)/i","<span style=\"font-weight:bold; color:#005200;\">$0</span>",$sentence);
where $str_regexp is a string with all the words we were looking for, with | as separator. So eg "formação|rápidos"
But if we perform a SQL Match, we can see that this sentence match also "Formacao" or "rapidos". For the Match Query, the fact we don't have "ã" is not a problem.
But when we want to highllight, the regexp don't work. It works in case incencitive but for it "formacao" is not the same as "formação" when for SQL, it's the same...
I suppose than this came from the fact the Fulltext index is probably a modified copy of the original text without the short words and without accent. The fact the index don't have the short words (2 or 3 letters) explain (maybe) that SQL is able to tell us "This sentence match the word you're searching for" but is unable to tel us WHERE are the words.
In order to highlight "formação" in the original text when the user look for "formacao", I do that:
function highlight($tab_mot,$text,$start,$end)
{
// Implode the array of searched words and avoid accent
$str_regexp = implode("|",$tab_mot);
$str_regexp = iconv("UTF-8", "US-ASCII//TRANSLIT", $str_regexp);
// Make a copy of the orignal text, but without accente
$text_tmp = iconv("UTF-8", "US-ASCII//TRANSLIT", $text);
// Look for all occurences of the word in case incencitive mode
// With the PREG_OFFSET_CAPTURE we will have in matches, and array
// of the result.
preg_match_all("/$str_regexp/i", $text_tmp, $matches, PREG_OFFSET_CAPTURE);
// Just to see what we get
echo "<pre>";
print_r($matches);
echo "</pre>";
$nb = count($matches[0]); // Number of matches
$idx_offset = 0;
$tab_offset_debut = array();
$tab_offset_fin = array();
for ($x = 0; $x < $nb; $x++)
{
$offset_debut = $matches[0][$x][1]; // Offset to start of word
$tab_offset_debut[$x] = $offset_debut;
// Offset to end is offset from start + length
$tab_offset_fin[$x] = $offset_debut+strlen($matches[0][$x][0]);
}
// We reverse the array. If not when we will perform the change on first
// word, all next offsets would be wrong
rsort($tab_offset_debut,SORT_NUMERIC);
rsort($tab_offset_fin,SORT_NUMERIC);
// Loop againts all offset (so from last to the first)
for ($x = 0; $x < $nb; $x++)
{
$offset_debut = $tab_offset_debut[$x];
$offset_fin = $tab_offset_fin[$x];
// Add tag after and THEN, before to preserve offsets values
$text = mb_substr($text, 0,$offset_fin,'UTF-8').$end.mb_substr($text,$offset_fin);
$text = mb_substr($text, 0,$offset_debut,'UTF-8').$start.mb_substr($text,$offset_debut);
}
echo"<hr>".$text;
return $text; // Return text with highligh
}
Parameters:
- tab_mot is an array of the words i used for the Query.
- text is the sentence matching the query
- start is the tag i want to insert before the word to highlight
- end is the tag at the end of the word
So tab_mot can have "formacao" when text can have "formação".
I think there is enought comment to understand. Notice the use of mb_substr rather the substr (mb_substr_replace don't exist).
Note: just a detail. In order to iconv to work correctly, don't forget to set the Local using
$ret = setlocale(LC_ALL, "pt_BR.utf-8"); // Brasilian Portuges in my case