Reduce a UTF-8 string for binary comparison

Question

I want to quickly check if an UTF-8 word exists as an array key.

The words may have:

different case
accented characters or not
different Unicode normalization forms

I can use mb_strtolower() to make them both lowercase, and Normalizer::normalize() to normalize the strings. This checks the first 2 bullet points, but does not handle accents:

'tést' !== 'test'

I can use Collator to compare both words:

$collator = new Collator('fr_FR');
$collator->setStrength(Collator::PRIMARY);
$collator->compare('tést', 'test'); // 0

This checks my 3 bullet points, but I now I have to loop over all my word pairs to compare them, when I want to be able to perform a binary lookup as an array key (I have many lookups to perform on a big dictionary).

What I want is:

function reduce($word) {
    // how?
}

// prepare the dictionary (once)

$dictionary = [];

foreach ($dictionaryWords as $dictionaryWord) {
    $dictionary[reduce($dictionaryWord)] = true;
}

// perform a lookup (many times)

if (isset($dictionary[reduce($lookupWord)])) {
    // it's a match!
}

Basically, I want the reduce() function (which may be poorly named) to perform a simplification like this one:

'TÈST' => 'test'
'Straße' => 'strasse'

I believe MySQL does something like this internally for its text indexes.

Is there an intl function that does this? The list of intl classes and functions is hard to digest.

https://stackoverflow.com/questions/1017599/how-do-i-remove-accents-from-characters-in-a-php-string might have what you need — Pete, Nov 14 '19 at 17:08
@Pete Most of the answers on this page are about hardcoding character maps, which is the poor man's solution to this problem. However, I can see deep down an answer about `Transliterator`, which may be what I'm looking for. I will test this and report the results. — BenMorel, Nov 14 '19 at 18:11

Ro Achterberg · Answer 1 · 2019-11-14T18:20:45.967

0

Reading your question, it seems that you're only really interested in checking if a word exists as a unique array index.

You could do this by cryptographically hashing the word and using the hash as the index. It would go something like this:

<?php
$word = 'TÈST';
$dictionary[sha1($word)] = TRUE;

Or use an algorithm that is more resilient to collision attacks, if that is a concern to you. Please elaborate your question if you need any pointers in that area.

UPDATE

Please see the snippet below, which produces "test, strasse".

<?php

setlocale(LC_ALL, 'nl_NL.UTF-8');

$words = [ 'TÈST', 'Straße' ];

foreach ($words as $index => $word)
{
    echo ($index?', ':'') . strtolower(iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $word));
}

edited Nov 14 '19 at 18:20

answered Nov 14 '19 at 18:04

Ro Achterberg

2,504
2
17
17

It looks like you did not properly read, or understand, the question. Your solution would fail if I'm looking for `'test'`, as `sha1('test') !== sha1('TÈST')`. This question is about collations, not about cryptographic hashes. – BenMorel Nov 14 '19 at 18:07
Perhaps. Perhaps you could have phrased it better. Please check out my update. – Ro Achterberg Nov 14 '19 at 18:18

score 0 · Accepted Answer · answered Nov 14 '19 at 18:33

What I'm looking for is the Transliterator class. An example can be found in this answer:

$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $transliterator->transliterate($string); // foo bar

Thanks to @Pete for the pointer in the comments.

This even works with non-european characters:

echo $transliterator->transliterate('Fóø Bår 学中文'); foo bar xue zhong wen

Where iconv would fail at the job:

echo iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', 'Fóø Bår 学中文'); // Foo Bar ???

Unless I'm missing some other iconv options, of course.

Reduce a UTF-8 string for binary comparison

2 Answers2