2

I have the following query in MySQL:

SELECT id FROM unicode WHERE `character` = 'a'

The table unicode contains each unicode character along with an ID (it's integer encoding value). Since the collation of the table is set to utf8_unicode_ci, I would have expected the above query to only return 97 (the letter 'a'). Instead, it returns 119 rows containing the IDs of many 'a'-like letters:

a A Ã ...

It seems to be ignoring both case and the multi-byte nature of the characters.

Any ideas?

David Jones
  • 10,117
  • 28
  • 91
  • 139
  • Out of curiosity, can't you just use [`ORD()`](http://dev.mysql.com/doc/en/string-functions.html#function_ord) instead of maintaining this table? For example, `SELECT ORD('a');` – eggyal Sep 14 '12 at 21:28
  • Well, I'm actually storing more than just the Unicode point in the database, but I left that out to simplify the question. Also, `ORD` is not Unicode compatible, though I did find a good alternative: http://stackoverflow.com/questions/11304582/searching-for-a-good-unicode-compatible-alternative-to-the-php-ord-function. – David Jones Sep 14 '12 at 21:59
  • Ah, nevemind - my mind was in PHP land. I'm not sure if the MySQL `ORD` is Unicode compatible or not... – David Jones Sep 14 '12 at 22:02
  • Also, it's not hard to maintain the Unicode table. I have a script that automatically scrapes http://www.unicode.org/Public/UNIDATA/UnicodeData.txt from time to time. – David Jones Sep 14 '12 at 22:04
  • Its manual entry suggests that one should be able to get the Unicode code point from `ORD(CONVERT('a' USING utf16))`. – eggyal Sep 14 '12 at 22:04
  • Huh - good to know. I may just use that... – David Jones Sep 14 '12 at 22:05

3 Answers3

2

The collation of the table is part of the issue; MySQL with a _ci collation is treating all of those 'a's as variants of the same character.

Switching to a _cs collation will force the engine to distinguish 'a' from 'A', and 'á' from 'Á', but it may still treat 'a' and 'á' as the same character.

If you need exact comparison semantics, completely disregarding the equivalency of similar characters, you can use the BINARY comparison operators

SELECT id FROM unicode WHERE BINARY character = 'a'
Ian Clelland
  • 43,011
  • 8
  • 86
  • 87
  • By default, MySQL doesn't ship with `*_cs` Unicode collations; also, `BINARY` will force a byte-by-byte match which may cause problems if a matched character has more than one encoding (which can happen in Unicode, especially with decorators). A Unicode binary collation `*_bin` is much more appropriate in this case. – eggyal Sep 14 '12 at 21:16
  • Thanks everyone. I changed the default collation of my database and every table to utf8_bin, but the problem was not resolved, but adding the `BINARY` bit did the trick. Any ideas why? – David Jones Sep 14 '12 at 21:18
  • Ah, that did the trick. I was using MySQL Workbench and I thought that changing the collation of a table cascaded down to the columns - apparently not! – David Jones Sep 14 '12 at 21:21
  • No, changing the table collation doesn't change the columns; that's a separate step – Ian Clelland Sep 14 '12 at 21:22
  • @eggyal -- Every reference I can find suggests that utf8_bin also does byte-by-byte comparisons. Are you saying that it forces data into one of the normalized forms and uses that form for collation? – Ian Clelland Sep 14 '12 at 21:32
  • The biggest problem with `BINARY` is that, if the operands being compared are in different character sets, identical strings will have different encodings and will therefore fail the comparison. Performing the comparison under a Unicode binary collation ensures that both strings are first converted to a Unicode encoding. Even then there are corner cases with straight byte-by-byte comparisons, e.g. with surrogate pairs in UTF-16. I was under the impression that the Unicode `*_bin` collations furthermore equated decorator combinations with their decorated alternatives, but I may well be wrong. – eggyal Sep 14 '12 at 21:38
2

As documented under Unicode Character Sets:

MySQL implements the xxx_unicode_ci collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.

The full collation chart makes clear that, in this collation, most variations of a base letter are equivalent irrespective of their lettercase or accent/decoration.

If you want to only match exact letters, you should use a binary collation such as utf8_bin.

eggyal
  • 122,705
  • 18
  • 212
  • 237
  • As mentioned in my other comments, changing the collation to utf8_bin did not resolve my problem. I'm not sure why... – David Jones Sep 14 '12 at 21:20
  • @David: Did you change the collation of the `character` column (which is really the only one that matters)? You could also explicitly specify a collation to use within your query: ``WHERE `character` = 'a' COLLATE utf8_bin``. – eggyal Sep 14 '12 at 21:20
1

The ci in the collation means case-insensitive. Switch to a case-sensitive collation (cs) to get the results you're looking for.

G-Nugget
  • 8,666
  • 1
  • 24
  • 31
  • By default, MySQL doesn't ship with `*_cs` Unicode collations. A Unicode binary collation `*_bin` is much more appropriate in this case. – eggyal Sep 14 '12 at 21:17
  • Nevermind - I didn't actually change the collation of the 'character' column to utf8_bin. When I did, it worked great. – David Jones Sep 14 '12 at 21:23
  • I agree, utf8_bin would actually be what you want for this case. – G-Nugget Sep 14 '12 at 22:22