10

I'm working a way to search for specific words in a text and highlight them. The code works perfectly, except I would like that it also matches similar letters. I mean, searching for fête should match fêté, fete, ...

Is there an easy & elegant way to do this?

This is my current code:

$regex='/(' . preg_replace('/\s+/', '|', preg_quote($usersearchstring)) .')/iu';

$higlightedtext = preg_replace($regex, '<span class="marked-search-text">\0</span>', $text);

My text is not html encoded. And searching in MariaDB matches the similar results.

[edit] And here a longer example of the issue:

$usersearchstring='fête';
$text='la paix fêtée avec plus de 40 cultures';
$regex='/(' . preg_replace('/\s+/', '|', preg_quote($usersearchstring)) .')/iu';
$higlightedtext = preg_replace($regex, '<span class="marked-search-text">\0</span>', $text);

Result is that $higlightedtext is identical to $text

When changing $higlightedtext the word "fêté" then $higlightedtext is

'la paix <span class="marked-search-text">fêté</span>e avec plus de 40 cultures'

However, I want it to match "always" all the variations of letters, since there can be (and are in reality) many variations of the words possible. And we have fête fêté and possible even fete in the database.

And I have been thinking about this, but the only solution I see is to have an huge array with all letter replacement options, then loop over them and try every variation. But that is not elegant and will be slow.(Since for many letters I have at least 5 variations: aáàâä, resulting in, if the word has 3 vowels that I need to do 75x (5x5x5) the preg_replace.

[/edit]

O. Jones
  • 103,626
  • 17
  • 118
  • 172
Kamiware
  • 111
  • 6

4 Answers4

7

Your question is about collation, the art of handling natural-language text to order and compare it using knowledge about languages' lexical rules. You're looking for case-insensitive and diacritical-mark-insensitive collation.

A common collation rule is B comes after A. A less common rule, but important to your question, is ê and e are equivalent. Collations contain lots of rules like these, worked out carefully over years. If you're using case-insensitive collation, you want rules like a and A are equivalent.

A diacritical rule that's true in most European languages, but not Spanish, is this: Ñ and N are equivalent. In Spanish, Ñ comes after N.

Modern databasese know about these collations. If you use MySQL for example, you can set up a column with a character encoding of utf8mb4 and a collation of utf8mb4_unicode_ci. This will do a good job with most languages (but not perfect for Spanish).

Regex technology is not very useful for collation work. If you use regex for this you're trying to reinvent the wheel, and you're likely to reinvent the flat tire instead.

PHP, like most modern programming languages, contains collation support, built in to its Collator class. Here's a simple example of the use of a Collator object for your accented-character use case. It uses the Collator::PRIMARY collation strength to perform the case- and accent- insensitive comparison.

mb_internal_encoding("UTF-8");
$collator  = collator_create('fr_FR');
$collator->setStrength(Collator::PRIMARY);
$str1 = mb_convert_encoding('fêté', 'UTF-8');
$str2 = mb_convert_encoding('fete', 'UTF-8');
$result = $collator->compare($str1, $str2);
echo $result;

The $result here is zero, meaning the strings are equal. That's what you want.

If you want to search for matching substrings within a string this way you need to do so with explicit substring matching. Regex technology doesn't provide that.

Here's a function to do the search and annotation (adding of <span> tags, for example). It takes full advantage of the Collator class's schemes for character equality.

function annotate_ci ($haystack, $needle, $prefix, $suffix, $locale="FR-fr") {

    $restoreEncoding = mb_internal_encoding();
    mb_internal_encoding("UTF-8");
    $len = mb_strlen($needle);
    if ( mb_strlen( $haystack ) < $len ) {
        mb_internal_encoding($restoreEncoding);
        return $haystack;
    }
    $collator = collator_create( $locale );
    $collator->setStrength( Collator::PRIMARY );

    $result = "";
    $remain = $haystack;
    while ( mb_strlen( $remain ) >= $len ) {
        $matchStr = mb_substr($remain, 0, $len);
        $match = $collator->compare( $needle, $matchStr );
        if ( $match == 0 ) {
            /* add the matched $needle string to the result, with annotations.
             * take the matched string from $remain
             */
            $result .= $prefix . $matchStr . $suffix;
            $remain = mb_substr( $remain, $len );
        } else {
            /* add one char to $result, take one from $remain */
            $result .= mb_substr( $remain, 0, 1 );
            $remain = mb_substr( $remain, 1 );
        }
    }
    $result .= $remain;
    mb_internal_encoding($restoreEncoding);
    return $result;
}

And here's an example of the use of that function.

$needle = 'Fete';  /* no diacriticals here! mixed case! */
$haystack= mb_convert_encoding('la paix fêtée avec plus de 40 cultures', 'UTF-8');

$result = annotate_ci($haystack, $needle, 
                      '<span class="marked-search-text">' , '</span>');

It gives back

 la paix <span class="marked-search-text">fêté</span>e avec plus de 40 cultures
O. Jones
  • 103,626
  • 17
  • 118
  • 172
4

A simple approach is to convert the input text to Unicode Normalization Form D which performs a Canonical Decomposition, splitting accented characters into a base character followed by combining marks. Sequences of base characters and marks can then be matched easily using PCREs Unicode features. Combining marks can be matched with \p{M}. Afterwards, convert the text back to NFC. Example for fetee:

$string = "la paix fêtée avec plus de 40 cultures";

$nfd = Normalizer::normalize($string, Normalizer::FORM_D);
$highlighted = preg_replace('/f\p{M}*e\p{M}*t\p{M}*e\p{M}*e\p{M}*/iu',
                            '<b>\0</b>', $nfd);
$nfc = Normalizer::normalize($highlighted, Normalizer::FORM_C);

print $nfc;

Generating the regex for search strings is straightforward. Decompose the search string, remove all combining marks, and insert \p{M}* after each character.

$string = "la paix fêtée avec plus de 40 cultures";
$keyword = "fêtée";

# Create regex.
$nfd = Normalizer::normalize($keyword, Normalizer::FORM_D);
$regex = preg_replace_callback('/(.)\p{M}*/su', function ($match) {
    return preg_quote($match[1]) . '\p{M}*';
}, $nfd);

# Highlight.
$nfd = Normalizer::normalize($string, Normalizer::FORM_D);
$highlighted = preg_replace('/' . $regex . '/iu', '<b>\0</b>', $nfd);
$nfc = Normalizer::normalize($highlighted, Normalizer::FORM_C);

This solution doesn't rely on hardcoded character tables and works with accented Latin characters beyond ISO-8859-1 which are often used in Eastern European languages. It even works with non-Latin scripts, for example Greek diacritics.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113
3

You can't reasonably do this only with RegExp. (you could, but it wouldn't be sane!)


Option 1: transliteration before search

What you should do is transliterate your needle and haystack strings into their ASCII equivalents, before testing them with a regex.

So 1) Temporarily convert your strings to ASCII and 2) Regex match.

Some people have already done work on the transliteration problem, which you could make use of: see https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/src/Patchwork/Utf8.php

Or, if you are only expecting French input, you could manually build a map of special characters and their ASCII equivalents. As far as I know French that would only need to consider a few vowels and ç.

Once you have the replacements map ready, just run your strings through a function which replaces all the special chars with their ASCII equivs, then you can do your Regex search on the "plain" strings.

As per your performance concerns, I wouldn't fret. For each of:

à : a
â : a
è : e
é : e
ê : e
ë : e
î : i
ï : i
ô : o
ù : u
ü : u
û : u
ç : c

Run a replace on your needle and haystack strings.

After those 13 iterations you get your two plain ASCII strings to test.


Option 2: native DB functions

And… if your data is in a database, you may not need to do anything but use what's already there: http://dev.mysql.com/doc/refman/5.7/en/charset.html


Option 3: dynamically generated search patterns

You can make a function that given:

  • a map of corresponding chars like the one above and
  • a word to find

generates a regular expression pattern which contains the matching character sets for each character that has valid alternates.

In this case if you search for féte your function would create a regex pattern like /(f[eéèêë]t[eéèêë])/iu which you could then use to find your text.

The only time-consuming part would be to create good character maps for all languages…

tmslnz
  • 1,801
  • 2
  • 15
  • 24
  • Thanks @tmslnz for the info. I was looking for a solution in regex like the /regex/i, this i and the end means case insensitive. Now our website has 38 languages, and most likely more will be added. This is also why I was not only thinking about the French characters. But our search always runs on only 1 language. So I could have different replacements for every language. (now most of them I don't understand, so I cannot do anything usefull.) – Kamiware Nov 05 '16 at 12:54
  • That's only about case (`A` or `a`). _Special_ characters are wholly different beast. There is no RegExp-only silver bullet to this. You need to _prepare_ your strings first, then regex match. – tmslnz Nov 05 '16 at 12:57
  • Have you looked at the library I linked to? I have not tested it, but appears very well written. Also if this content is inside a database I suggest you look into your DB's features for something that can search like you want it to. I suspect you may be barking at the wrong tree :) – tmslnz Nov 05 '16 at 13:00
  • Yes "sorry" my idea that this is the same à vs a is the same like A vs a comes from that I changed the DB from utf8_bin to utf8_unicode_ci. And then automatically MariaDB (MySQL) started to match fêté when I searched for fête. (I'm happy with that.). And then I (mistakenly) was thinking that PHP could also do that. (So the SQL part I'm happy with how the code works.) – Kamiware Nov 05 '16 at 13:14
  • I looked a bit to the library, but it looks to complicated and to big for me. I'm not a webdeveloper but only the sysadmin, who complained about the website code (written by unpaid volunteers) and got the answer I could fix it. (And I'm also not a real PHP developer.) And well, the code is already not that fast, so I like not to add ballast, and also, I like to not do something that destroys the function in some language I don't understand. -- I'm now thinking if I can language dependent replace e with [eéèê] and run it like that once trough the regex. – Kamiware Nov 05 '16 at 13:18
  • I'm thinking you can also build your Regex dynamically. For `féte` you would build your `$searchstring` as `f[eéèêë]t[eéèêë]`. You still need to have a _map_ or corresponding chars, but would only need to build the regex pattern on the $searchstring, rather than convert both `$searchstring` and `$highlightedtext` – tmslnz Nov 05 '16 at 13:27
  • You *can* solve this easily with Canonical Decomposition and Unicode regexes. See my answer. – nwellnhof Nov 05 '16 at 15:10
1

There's unfortunately no magic character class or trick in php regex (that I know of) that could solve this out of the box. I've instead opted for another route:

$search = '+  fête   foret   ca rentrée w0w !!!';
$text = 'La paix fêtée avec plus de 40 cultures dans une forêt. Ça commence bien devant la rentrée...<br> Il répond: w0w tros cool!!! En + il fait chaud!';
$left_token = '<b>';
$right_token = '</b>';
$encoding = 'UTF-8';

// Let's normalize both search and needle
$search_normalized = normalize($search);
$text_normalized = normalize($text);

// Fixed preg_quote() and match UTF whitespaces
$search_needles = preg_split('/\s+/u', $search_normalized);

// We'll save the output in a separate variable
$text_output = $text;

// Since we made the tokens a variable, we'll need to calculate the offsets
$offset_size = strlen($left_token . $right_token);

// Start searching
foreach($search_needles as $needle) {
    // Reset for each word
    $search_offset = 0;

    // We may have several occurences
    while(true) {
        if($search_offset > mb_strlen($text_normalized)) { // No more needles
            break;
        } else {
            $pos = mb_stripos($text_normalized, $needle, $search_offset, $encoding);
        }

        if($pos === false) { // No more needles here
            break;
        }
        $len = mb_strlen($needle);

        // Insert tokens
        $text_output = mb_substr($text_output, 0, $pos, $encoding) . // Left side
                       $left_token . 
                       mb_substr($text_output, $pos, $len, $encoding) . // The enclosed word
                       $right_token .
                       mb_substr($text_output, $pos + $len, NULL, $encoding); // Right side

        // We need to update this too otherwise the positions won't be the same
        $text_normalized = mb_substr($text_normalized, 0, $pos, $encoding) . // Left side
                       $left_token . 
                       mb_substr($text_normalized, $pos, $len, $encoding) . // The enclosed word
                       $right_token .
                       mb_substr($text_normalized, $pos + $len, NULL, $encoding); // Right side

        // Advance in the search
        $search_offset = $pos + $len + $offset_size;
    }
}

echo($text_output);
var_dump($text_output);

// Credits: http://stackoverflow.com/a/10064701
function normalize($input) {
    $normalizeChars = array(
        'Š'=>'S', 'š'=>'s', 'Ð'=>'Dj','Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A',
        'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E', 'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I',
        'Ï'=>'I', 'Ñ'=>'N', 'Ń'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U',
        'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss','à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a',
        'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i',
        'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ń'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u',
        'ú'=>'u', 'û'=>'u', 'ü'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y', 'ƒ'=>'f',
        'ă'=>'a', 'î'=>'i', 'â'=>'a', 'ș'=>'s', 'ț'=>'t', 'Ă'=>'A', 'Î'=>'I', 'Â'=>'A', 'Ș'=>'S', 'Ț'=>'T',
    );
    return strtr($input, $normalizeChars);
}

Basically:

  1. Normalize: Convert needle and haystack to normal ASCII characters.
  2. Find position: Search for the position of the normalized needle in the normalized haystack.
  3. Insert: Insert the opening and closing tag accordingly into the original string.
  4. Repeat: Sometimes you may have several occurrences. This process is repeated until no occurrence is left.

Sample output:

La paix <b>fêté</b>e avec plus de 40 cultures dans une <b>forêt</b>. <b>Ça</b> commence bien devant la <b>rentrée</b>...<br> Il répond: <b>w0w</b> tros cool<b>!!!</b> En <b>+</b> il fait chaud!
HamZa
  • 14,671
  • 11
  • 54
  • 75