2

I have dumped my small MySQL table (manually reduced to localize the problem) to show it here:

SET SQL_MODE = "NO_AUTO_VALUE_ON_ZERO";
SET time_zone = "+00:00";


/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8mb4 */;



CREATE TABLE `symb` (
  `smb` varchar(200) NOT NULL,
  `trtmnt` varchar(200) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


INSERT INTO `symb` (`smb`, `trtmnt`) VALUES
('і', 'ty'),
('ї', 'hr');


/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;

If you create the MySQL table above and run this query

select * from symb where smb = 'ї';

or this one (queries are different - please see the symbols 'ї' vs 'і')

select * from symb where smb = 'і';

then you may see you get two rows selected instead of one as I would expect.

To reemphasize, these two select queries above are different - the symbol 'ї' is different from 'і' (both are cyrillic symbols, 'і' is NOT latin here).

Collation chosen was utf8_general_ci

Any reasons why 'і' and 'ї' are treated as the same symbols and what's the proper way to make it different? I need to select the exact row, not two.

Queries above were tested in phpMyAdmin and HeidiSQL which means that's MySQL (collation?) issue, not the program used to run queries. Each different symbol should be treated as a different symbol and the table should be case sensitive. What's wrong with the table above? As result I'm unable to set unique key for this row.

Thank you.

Just added based on comments: What does SHOW TABLE STATUS LIKE 'symb' show you? It shows me:

Name    symb
Engine  InnoDB
Version 10
Row_format  Compact
Rows    2
Avg_row_length  8192
Data_length 16384
Max_data_length 0
Index_length    0
Data_free   0
Auto_increment  NULL
Create_time 22.05.16 12:11
Update_time NULL
Check_time  NULL
Collation   utf8_general_ci
Checksum    NULL
Create_options  
Comment 
Haradzieniec
  • 9,086
  • 31
  • 117
  • 212

4 Answers4

3

That is the way, how the collation chosen by you works. You can look here for more information: https://stackoverflow.com/a/1036459/4099089

Community
  • 1
  • 1
michaJlS
  • 2,465
  • 1
  • 16
  • 22
  • Thank you for the link. Am I right the recommended table should be utf8_unicode_ci? Am I right I should create a COMPLETELY new table without converting the existing table? Are both questions have the recommended answer YES? Thank you. – Haradzieniec May 22 '16 at 09:46
  • 1
    Michał , I've followed your recommendations - started from scratch with utf8_unicode_ci. The problem with 'ї' vs 'і' resolved. Another similar problem appeared - the same issue with ґ vs г. Any ideas how to fix that? Thank you. – Haradzieniec May 22 '16 at 10:44
  • I am sorry, but I don't know. I would look at other collations and try to check online if there are no recommendations about best collation for your language. – michaJlS May 22 '16 at 10:56
2

Because your SELECT statement is returning both records, it appears that your data has already been encoded wrongly into UTF-8. So merely changing the encoding of the smb column from Latin1 to UTF-8 won't work. One option for you would be to dump the database to binary, and then reimport it as UTF-8:

mysqldump --add-drop-table your_database | replace CHARSET=latin1 CHARSET=utf8 |
    iconv -f latin1 -t utf8 | mysql your_database

Read here and here for more information.

Community
  • 1
  • 1
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

Which do you want?

D197       1111=x0457  [ї]    L  CYRILLIC SMALL LETTER YI
C3AF        239=x00EF  [ï]    L  LATIN SMALL LETTER I WITH DIAERESIS

If you do SELECT col, HEX(col) ... you should get either D197 or C3AF for a correctly stored YI or i-umlaut. That is the best way to tell if it was stored correctly as utf8 (or utf8mb4).

They look the same, but they are treated differently. All the utf8/utf8mb4 collations sort all Cyrillic letters after all Latin letters.

The "best" "general" collation is utf8mb4_unicode_520_ci. (utf8, instead of utf8mb4, is ok if you don't need Chinese or Emoji.)

Here is my rundown of how Western European characters compare in various utf8/utf8mb4 collations. utf8_spanish2_ci, for example, is the only one to treat ll as a 'separate character', after all other l values. utf8_latvian_ci handles Ķ and Ļ as separate letters. Etc.

SHOW TABLE STATUS shows the default for the table; you need to look at SHOW CREATE TABLE to see if any column overrides that default.

Rick James
  • 135,179
  • 13
  • 127
  • 222
0

I've solved* this issue in the following way:

1) Change table collation to utf8mb4_unicode_520_ci

ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci

This allows you to insert all letters in Ukrainian alphabet except for ґ. This also allows you to sort letters the way they are supposed to.

2) Change column collation to utf8mb4_bin

ALTER TABLE table_name MODIFY column_name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;

This allows you to insert ґ character.

*The only drawback of this approach is that when sorting you have to use

SELECT * FROM table_name ORDER BY column_name COLLATE utf8mb4_unicode_520_ci ASC

But still, it won't sort DESC

Alex Lomakin
  • 433
  • 5
  • 9