how-to query mysql utf-8 table using ASCII

Question

I have created a mysql 5.6 table with a column encoded in utf-8, for characters in Romanian, Czech, Hungarian, Polish, French, German, Scandinavian language(s) - i.e. european characters, but quite non-ASCII.

However, i would like to query this column using just ASCII characters - e.g. in the LIKE clause- so that ă,î,â,ș,ț,ü,ä,ö etc. characters can be (succesfully) queried using a,e,i,o,u,s,t etc.

Is that even possible ?

score 0 · Answer 1 · answered May 12 '15 at 08:18

Well, I don't see it possible by any conventional way using only SQL. You only can write query preprocessor, that will automatically replace ascii characters with european. https://php.net/manual/en/function.str-replace.php - assuming you are using PHP But you still need to feed every query to it.

score 0 · Answer 2 · edited May 23 '17 at 12:06

0

I found a partial answer to my question:

If the character set you define for the column is utf8_general_ci, then many, (if not all) flavors of a,e,o,u will be found by a query using plain a,e,o,u. I even found the n in Wołoszyńska using plain n.

UNFORTUNATELY, the lowercase L "with oblique bar" in the same word was not found.

The answer was suggested by dddd's answer here

edited May 23 '17 at 12:06

Community

1
1

answered May 12 '15 at 08:47

Mikey

117
2
12

Have you tried `utf8_unicode_ci` as well? It contains additional mappings (though not sure if it specifically covers your L). – deceze May 12 '15 at 09:31
@deceze Unfortunately, it doesn't work. I've noticed that "important sites" refrain from using it and instead use the plain L. – Mikey May 12 '15 at 14:10

score 0 · Accepted Answer · answered May 12 '15 at 22:51

There is a cheat sheet for knowing what letters map "equal" under what collations in utf8 collations It agrees that Ł is not mapped to L for any collation. general_ci sorts it after Z; utf8_unicode_520_ci sorts it with L; the rest sort it before M.

polish_ci treats Ę as distinct from the rest of the E-like characters. Ditto for Ą. The Baltic states tend to keep certain accented consonants separate.

In polish_ci, Ń (hex C584) collates after N and before O; the other collations treate it equal to N.

utf8_unicode_520_ci is probably the best collation for you.

Also, you might consider "combining" accents -- where two utf8 'characters' "combine" to make a single characters. utf8_unicode_ci collates 'correctly' for most of them, as seen here.

how-to query mysql utf-8 table using ASCII

3 Answers3