6

Desired result :

Have an accent sensitive primary key in MySQL.

I have a table of unique words, so I use the word itself as a primary key (by the way if someone can give me an advice about it, I have no idea if it's a good design/practice or not).

I need that field to be accent (and why not case) sensitive, because it must distinguish between, for instance, 'demandé' and 'demande', two different inflexions of the French verb "demander". I do not have any problem to store accented words in the database. I just can't insert two accented characters strings that are identical when unaccented.

Error :

When trying to create the 'demandé' row with the following query:

INSERT INTO `corpus`.`token` (`name_token`) VALUES ('demandé');

I got this error :

ERROR 1062: 1062: Duplicate entry 'demandé' for key 'PRIMARY'

Questions :

  • Where in the process should a make a modification in order to have two different unique primary keys for "demande" and "demandé" in that table ?

SOLUTION using 'collate utf8_general_ci' in table declaration

  • How can i make accent sensitive queries ? Is the following the right way : SELECT * FROM corpus.token WHERE name_token = 'demandé' COLLATE utf8_bin

SOLUTION using 'collate utf8_bin' with WHERE statement

  • I found that i can achieve this point by using the BINARY Keyword (see this sqlFiddle). What is the difference between collate and binary?

  • Can I preserve other tables from any changes ? (I'll have to rebuild that table anyway, because it's kind of messy)

  • I'm not very comfortable with encoding in MySQL. I don't have any problem yet with encoding in that database (and I'm kind of lucky because my data might not always use the same encoding... and there is not much I can do about it). I have a feeling that any modification regarding to that "accent sensitive" issue might create some encoding issue with other queries or data integrity. Am I right to be concerned?

Step by step :

Database creation :

CREATE DATABASE corpus DEFAULT CHARACTER SET utf8;

Table of unique words :

CREATE TABLE token (name_token VARCHAR(50), freq INTEGER, CONSTRAINT pk_token PRIMARY KEY (name_token))

Queries

SELECT * FROM corpus.token WHERE name_token = 'demande';
SELECT * FROM corpus.token WHERE name_token = 'demandé';

both returns the same row:

demande
mxdsp
  • 209
  • 2
  • 9
  • I found a way using the BINARY keyword. It seems to work in this [fiddle](http://www.sqlfiddle.com/#!9/99787/1). Is this the good way to do what I asked ? – mxdsp Dec 21 '15 at 01:51

3 Answers3

7

Collations. You have two choices, not three:

utf8_bin treats all of these as different: demandé and demande and Demandé.

utf8_..._ci (typically utf8_general_ci or utf8_unicode_ci) treats all of these as the same: demandé and demande and Demandé.

If you want only case sensitivity (demandé = demande, but neither match Demandé), you are out of luck.

If you want only accent sensitivity (demandé = Demandé, but neither match demande), you are out of luck.

Declaration. The best way to do whatever you pick:

CREATE TABLE (
    name VARCHAR(...)  CHARACTER SET utf8  COLLATE utf8_...  NOT NULL,
    ...
    PRIMARY KEY(name)
)

Don't change collation on the fly. This won't use the index (that is, will be slow) if the collation is different in name:

WHERE name = ... COLLATE ...

BINARY. The datatypes BINARY, VARBINARY and BLOB are very much like CHAR, VARCHAR, and TEXT with COLLATE ..._bin. Perhaps the only difference is that text will be checked for valid utf8 storing in a VARCHAR ... COLLATE ..._bin, but it will not be checked when storing into VARBINARY.... Comparisons (WHERE, ORDER BY, etc) will be the same; that is, simply compare the bits, don't do case folding or accent stripping, etc.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • 1
    In 8.0, see `utf8mb4_..._as_ci` – Rick James Jun 13 '19 at 15:59
  • 1
    In MySQL 8.0.2 they introduced a new collation called utf8mb4_0900_as_ci ("as" = accent sensitive, "ci" = case insensitive). To date, this is the only collation that is accent sensitive and case insensitive. – Jugdish Mar 29 '21 at 18:48
  • 1
    Note that utf8mb4_0900_as_ci does not collapse trailing whitespace in VARCHAR types. – GPHemsley Oct 30 '21 at 04:37
1

May be you need this

_ci in a collation name=case insensitive

If your searches on that field are always going to be case-sensitive, then declare the collation of the field as utf8_bin... that'll compare for equality the utf8-encoded bytes.

col_name varchar(10) collate utf8_bin

If searches are normally case-insensitive, but you want to make an exception for this search, try;

WHERE col_name = 'demandé' collate utf8_bin

More here

Songs
  • 267
  • 1
  • 2
  • 14
  • I'm going to give it a try. What would be the difference with stating BINARY keyword ? What happens if, accidently, i store data encoding with something else than utf8 ? – mxdsp Dec 21 '15 at 17:22
1

Try this

mysql> SET NAMES 'utf8' COLLATE 'utf8_general_ci';
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE t1
-> (c1 CHAR(1) CHARACTER SET UTF8 COLLATE utf8_general_ci);
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO t1 VALUES ('a'),('A'),('À'),('á');
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> SELECT c1, HEX(c1), HEX(WEIGHT_STRING(c1)) FROM t1;
+------+---------+------------------------+
| c1   | HEX(c1) | HEX(WEIGHT_STRING(c1)) |
+------+---------+------------------------+
| a    | 61      | 0041                   |
| A    | 41      | 0041                   |
| À    | C380    | 0041                   |
| á    | C3A1    | 0041                   |
+------+---------+------------------------+
4 rows in set (0.00 sec)
sk juli kaka
  • 470
  • 1
  • 5
  • 19
  • Could you explain a little what it does ? What is the difference between utf8_general_ci and utf8_bin ? – mxdsp Dec 21 '15 at 17:24
  • http://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci – sk juli kaka Dec 22 '15 at 01:37