Highlight search term in mysql php search accented characters also

Question

Following answer 3 to this question "Highlight search term in mysql php search" I could achieve word highlighting, the only thing I still can not figure out is how to highlight accented versions of the word, the query do find for instance "wesha" and "weshá" but highlighting only works for "wesha"..

here is my code:

echo "<p>".str_replace($palabra,"<strong>$palabra</strong>",$row['definicion'])."</p>";

Thanks

By the way, changing to str_ireplace makes the match with words in capital but change them into non capital letters, is there a way to include this also?

you mean "weshá" is not being replaced with bold , when it is present in $row['definicion'] ? Am I getting it right ? — Dimag Kharab, Feb 12 '14 at 04:53
for me its working , I mean not replacing but making that word (both) bold — Dimag Kharab, Feb 12 '14 at 05:03
if I search either 'wesha' or weshá' the query returns both but highlighting works only for the exact search, I mean if I search 'weshá' results shows both, but highlights only 'weshá' — Andrés Chandía, Feb 12 '14 at 05:05
I have changed collation at the db from utf8_spanish_ci to utf8_bin, and now the search only gives back the exact match so the highlighting is coincident with that. It is not the better solution but at least there is no confusion for the user. A bad thing though is that `str_ireplace` does not work anymore. — Andrés Chandía, Feb 12 '14 at 06:03

Peter · Answer 1 · 2022-01-08T14:23:10.337

I know it's an old question, but after searching a lot, I've found no correct answer. So here is the way I do that in Portuges and I think this will work on other languages. For US or English reader here is the description of the problem.

Let's say we have a sentence like "... sobre formação, os cursos são rápidos." We have "ç" and "ã" in "formação" and a "á" in "rápidos".

If we use SQL Match, we will see that this sentence is seen as "correct result" when we search for "formação" but also "Formação", meaning the Match is case incencitive.

When we want then to higlght we can use a regexp like this one:

  $sentence =  preg_replace("/($str_regexp)/i","<span style=\"font-weight:bold; color:#005200;\">$0</span>",$sentence);

where $str_regexp is a string with all the words we were looking for, with | as separator. So eg "formação|rápidos"

But if we perform a SQL Match, we can see that this sentence match also "Formacao" or "rapidos". For the Match Query, the fact we don't have "ã" is not a problem. But when we want to highllight, the regexp don't work. It works in case incencitive but for it "formacao" is not the same as "formação" when for SQL, it's the same...

I suppose than this came from the fact the Fulltext index is probably a modified copy of the original text without the short words and without accent. The fact the index don't have the short words (2 or 3 letters) explain (maybe) that SQL is able to tell us "This sentence match the word you're searching for" but is unable to tel us WHERE are the words.

In order to highlight "formação" in the original text when the user look for "formacao", I do that:

function highlight($tab_mot,$text,$start,$end)
 {
// Implode the array of searched words and avoid accent
$str_regexp = implode("|",$tab_mot);
$str_regexp = iconv("UTF-8", "US-ASCII//TRANSLIT", $str_regexp);

// Make a copy of the orignal text, but without accente
$text_tmp = iconv("UTF-8", "US-ASCII//TRANSLIT", $text);

// Look for all occurences of the word in case incencitive mode
// With the PREG_OFFSET_CAPTURE we will have in matches, and array
// of the result.
preg_match_all("/$str_regexp/i", $text_tmp, $matches, PREG_OFFSET_CAPTURE);

// Just to see what we get
echo "<pre>";
print_r($matches);
echo "</pre>";

$nb = count($matches[0]);   // Number of matches
$idx_offset = 0;
$tab_offset_debut =  array();
$tab_offset_fin =  array(); 

for ($x = 0; $x < $nb; $x++)
{
    $offset_debut = $matches[0][$x][1]; // Offset to start of word
    $tab_offset_debut[$x] = $offset_debut;
    // Offset to end is offset from start + length
    $tab_offset_fin[$x] = $offset_debut+strlen($matches[0][$x][0]);
}

// We reverse the array. If not when we will perform the change on first
// word, all next offsets would be wrong
rsort($tab_offset_debut,SORT_NUMERIC);
rsort($tab_offset_fin,SORT_NUMERIC);

// Loop againts all offset (so from last to the first)
for ($x = 0; $x < $nb; $x++)
{
    $offset_debut = $tab_offset_debut[$x];
    $offset_fin = $tab_offset_fin[$x];  
    // Add tag after and THEN, before to preserve offsets values
    $text = mb_substr($text, 0,$offset_fin,'UTF-8').$end.mb_substr($text,$offset_fin);
    $text = mb_substr($text, 0,$offset_debut,'UTF-8').$start.mb_substr($text,$offset_debut);
}

echo"<hr>".$text;
return $text;   // Return text with highligh

 }

Parameters:

tab_mot is an array of the words i used for the Query.
text is the sentence matching the query
start is the tag i want to insert before the word to highlight
end is the tag at the end of the word

So tab_mot can have "formacao" when text can have "formação".

I think there is enought comment to understand. Notice the use of mb_substr rather the substr (mb_substr_replace don't exist).

Note: just a detail. In order to iconv to work correctly, don't forget to set the Local using $ret = setlocale(LC_ALL, "pt_BR.utf-8"); // Brasilian Portuges in my case

score -1 · Answer 2 · answered Feb 12 '14 at 06:09

This is due to database is doing transliteration while doing search. i.e. if you search 'á' then it gets matches for both 'á' and 'a' (translit). Your application code needs to do same transliteration for the text highlighting. Use iconv for this goal: http://in2.php.net/manual/en/function.iconv.php

score -1 · Answer 3 · answered Feb 20 '17 at 14:59

Here is a PHP class that will highlight all occurrences of a search term in some HTML text by exploiting the PHP class Transliterator, which is available since PHP 5.4 and with the intl extension installed.

This class will transliterate each character in the HTML and then do a character-wise comparison of the search term and the transliterated HTML. It will highlight the matching terms using an HTML span element with the provided $css_class.

This class also supports characters whose transliteration yields more than one character, e.g. the Japanese character 手 transliterates to shou, so the characters 手ld will be highlighted in the text if the search term is should.

The class is only limited by the capabilities of PHP's Transliterator class implementation.

//------------------------------------------------------------------------------------------
// highlights all occurrences of an ascii $term_to_highlight in some
// $html string that may contain all sorts of weird characters
class SearchResultHighlighter {
//------------------------------------------------------------------------------------------
    public $term_to_highlight;
    protected $term_len;
    protected static $transliterator = null;

    //------------------------------------------------------------------------------------------
    public function __construct(
        $term_to_highlight,     // must be an already transliterated search term (ASCII only)
        $transliterator_rules   // rules passed to Transliterator::createFromRules
    ) {
        $this->term_to_highlight = $term_to_highlight;
        $this->term_len = mb_strlen($this->term_to_highlight);
        if(self::$transliterator === null) // Transliterator only available PHP >= 5.4.0, PECL intl >= 2.0.0
            self::$transliterator = class_exists('Transliterator') ? Transliterator::createFromRules($transliterator_rules) : null;
    }

    //------------------------------------------------------------------------------------------
    public function highlight(
        $html,              // the HTML in which to highlight all occurrences of $this->term_to_highlight
        $css_class = 'hl'   // the CSS class used to highlight occurrences
    ) {
        if(self::$transliterator === null)
            return $html;
        $result = '';
        $source_len = mb_strlen($html);
        $matched_term_chars = 0;
        $source_match_startpos = 0;
        $source_match_len = 0;
        for($i = 0; $i < $source_len; $i++) {
            $c = mb_substr($html, $i, 1);
            $c_trans = mb_strtolower(self::$transliterator->transliterate($c));
            $c_trans_len = mb_strlen($c_trans); // note: single transliterated chars can be more than one char, e.g. transliterate('手') yields 'shou'
            if($c_trans_len <= $this->term_len - $matched_term_chars && $c_trans === mb_substr($this->term_to_highlight, $matched_term_chars, $c_trans_len))    {
                if($matched_term_chars == 0)
                    $source_match_startpos = $i;
                $matched_term_chars += $c_trans_len;
                $source_match_len++;
                if($matched_term_chars == $this->term_len) {
                    $result .= sprintf('<span class="%s">%s</span>', $css_class, mb_substr($html, $source_match_startpos, $source_match_len));
                    $matched_term_chars = $source_match_len = 0;
                }
            }
            else {
                $result .= $source_match_len > 0 ? mb_substr($html, $source_match_startpos, $source_match_len + 1) : $c;
                $matched_term_chars = $source_match_startpos = $source_match_len = 0;
            }
        }
        return $result;
    }
}

For instance, you can use this as follows.

$html = '<p>ŁoreM Ìpsum Ðolór. Šit Ämet. Some really long, accénted and diactritical stuff, e.g. the names Ḥasan or Abū ʿĀṣī come with some diacritics. James Bond loves Ms. Pussy Galore!</p>';

$transliteration_rules = ':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;';

$highlighter = new SearchResultHighlighter('lore', $transliteration_rules);

echo $highlighter->highlight($html, 'yellow-bold');

(Note: for explanation of transliteration rules refer to the PHP documentation of the Transliterator::createFromRules method) This will produce.

<p><span class="yellow-bold">Łoré</span>M Ìpsum Ðolór. Šit Ämet. Some really long, accénted and diactritical stuff, e.g. the names Ḥasan or Abū ʿĀṣī come with some diacritics. James Bond loves Ms. Pussy Ga<span class="yellow-bold">lore</span>!</p>

Of course, in your CSS you should have something like

span.yellow-bold {
  background-color: yellow;
  font-weight: bold;
}

Highlight search term in mysql php search accented characters also

3 Answers3