I want to quickly check if an UTF-8 word exists as an array key.
The words may have:
- different case
- accented characters or not
- different Unicode normalization forms
I can use mb_strtolower()
to make them both lowercase, and Normalizer::normalize()
to normalize the strings. This checks the first 2 bullet points, but does not handle accents:
'tést' !== 'test'
I can use Collator
to compare both words:
$collator = new Collator('fr_FR');
$collator->setStrength(Collator::PRIMARY);
$collator->compare('tést', 'test'); // 0
This checks my 3 bullet points, but I now I have to loop over all my word pairs to compare them, when I want to be able to perform a binary lookup as an array key (I have many lookups to perform on a big dictionary).
What I want is:
function reduce($word) {
// how?
}
// prepare the dictionary (once)
$dictionary = [];
foreach ($dictionaryWords as $dictionaryWord) {
$dictionary[reduce($dictionaryWord)] = true;
}
// perform a lookup (many times)
if (isset($dictionary[reduce($lookupWord)])) {
// it's a match!
}
Basically, I want the reduce()
function (which may be poorly named) to perform a simplification like this one:
- 'TÈST' => 'test'
- 'Straße' => 'strasse'
I believe MySQL does something like this internally for its text indexes.
Is there an intl
function that does this? The list of intl
classes and functions is hard to digest.