mySQL - matching latin (english) form input to utf8 (non-English) data

Question

I maintain a music database in mySQL, how do I return results stored under e.g. 'Tiësto' when people search for 'Tiesto'?

All the data is stored under full text indexing, if that makes any difference.

I'm already employing a combination of Levenshtein in PHP and REGEXP in SQL - not in trying to solve this problem, but just for increased searchability in general.

PHP:

function Levenshtein($word) {

$words = array();
for ($i = 0; $i < strlen($word); $i++) {
    $words[] = substr($word, 0, $i) . '_' . substr($word, $i);
    $words[] = substr($word, 0, $i) . substr($word, $i + 1);
    $words[] = substr($word, 0, $i) . '_' . substr($word, $i + 1);
    }
$words[] = $word . '_';
return $words;
}

$fuzzyartist = Levenshtein($_POST['searchartist']);
$searchimplode = "'".implode("', '", $fuzzyartist)."'";

mySQL:

SELECT *
FROM new_track_database
WHERE artist REGEXP concat_ws('|', $searchimplode);

To add, I frequently perform character set conversions and string sanitation in PHP, but these have always been the OTHER way - standardising non latin characters. I can't get my head around performing the oppsite process, but only in certain circumstances based on the data I've got stored.

possible duplicate of [MySQL REGEXP query - accent insensitive search](http://stackoverflow.com/questions/14137273/mysql-regexp-query-accent-insensitive-search) — Undo, Oct 03 '14 at 19:04
I am already using a REGEXP in my query; and I'm not skilled enough to know about combining multiple regexps, if that's indeed possible. Therefore, perhaps something useful can still come from this discussion. — user3174541, Oct 03 '14 at 19:11
Store data in MySQL as both the original, and as a sanitized, searchable version that only has latin characters? — wavemode, Oct 03 '14 at 19:38
that's an excellent idea, wavemode. it's no hassle to create a duplicate table for full text searching purposes only. thanks. — user3174541, Oct 03 '14 at 19:58
have you tried setting a collaction of utf8_unicode_ci and then during to run a regular query? — Henry, Jul 26 '17 at 06:11

score 2 · Answer 1 · answered Oct 03 '14 at 20:37

A possible solution would be creating another column in the database next to "artist", like "artist_normalized". Here, while populating the table, you could insert a "normalized" version of the string. Search can then be performed against the artist_normalized column.

A test code:

<?php
$transliterator = Transliterator::createFromRules(':: NFD; :: [:Nonspacing Mark:] Remove; :: NFC;', Transliterator::FORWARD);
$test = ['abcd', 'èe', '€', 'àòùìéëü', 'àòùìéëü', 'tiësto'];
foreach($test as $e) {
    $normalized = $transliterator->transliterate($e);
    echo $e. ' --> '.$normalized."\n";
}
?>

Result:

abcd --> abcd
èe --> ee
€ --> €
àòùìéëü --> aouieeu
àòùìéëü --> aouieeu
tiësto --> tiesto

The magic is done by the Transliterator class. The specified rule performs three actions: decomposes the string, removes diacritics and then recomposes the string, canonicalized. Transliterator in PHP is built on top of ICU, so by doing this you're relying on the tables of the ICU library, which are complete and reliable.

Note: this solution requires PHP 5.4 or greater with the intl extension.

Sadly I don't have 5.4, but this is still an ingenious solution! And reinforces the idea above. Fuzzy searching seems like a black hole; not a huge surprise given that this is really the foundation of Google's business model. Thanks e grazie for your time and effort. — user3174541, Oct 03 '14 at 23:51
If you don't have 5.4 or greater, then change host ;) it's 2014! @user3174541 — ItalyPaleAle, Oct 04 '14 at 03:37

mySQL - matching latin (english) form input to utf8 (non-English) data

1 Answers1

Linked