0

I have to import some data containing national characters not coded as Unicode into stand alone MySQL Version: 5.0.18 running on Windows7 64bit. After some initial problems I finally got it working in MySQL console.

But as the data is more than 50 MByte typing into console and or using clipboard is not possible. So I created script file only to find out that the national characters are gibberish after import.

The problem is that if I use source command with any file the encoding broke. If I open the same file and copy the stuff to console by clipboard all works as should. Here smallest MCVE to test this:

DROP DATABASE IF EXISTS dbs;
CREATE DATABASE dbs;
USE dbs;

SET NAMES latin2;

DROP TABLE IF EXISTS `tab`;
CREATE TABLE `tab` (`ix` INT default 0,`nam` VARCHAR(50) default '' );
INSERT INTO `tab` VALUES
 (1,'aacdeillnoorrstuuyzAACDEILLNOORRSTUUYZ'),
 (2,'áäčďéíĺľňóôŕřšťú ýžÁ ČĎÉÍĹĽŇÓ ŔŘŠŤÚ ÝŽ');
SELECT * FROM `tab`;

When I copy this to MySQL console by clipboard the output is like this:

+------+----------------------------------------+
| ix   | nam                                    |
+------+----------------------------------------+
|    1 | aacdeillnoorrstuuyzAACDEILLNOORRSTUUYZ |
|    2 | áäčďéíĺľňóôŕřšťú ýžÁ ČĎÉÍĹĽŇÓ ŔŘŠŤÚ ÝŽ |
+------+----------------------------------------+
2 rows in set (0.00 sec)

Which is desired. But when I put all this into test.sql file and run

source test.sql;

I got this output:

+------+----------------------------------------+
| ix   | nam                                    |
+------+----------------------------------------+
|    1 | aacdeillnoorrstuuyzAACDEILLNOORRSTUUYZ |
|    2 | ßńŔ´ÚÝňż˛ˇ˘Ó°ÜŁ˙ ř×┴ ╚¤╔═┼╝ĎË └ěŐŹ┌ ŢÄ |
+------+----------------------------------------+

Which is obviously wrong (looks like some default MS-DOS charset). I think the problem is not on side of table nor database as this is the same for pure text outs like:

SET NAMES latin2;
SELECT 'áäčďéíĺľňóôŕřšťú ýžÁ ČĎÉÍĹĽŇÓ ŔŘŠŤÚ ÝŽ' AS 'aacdeillnoorrstuuyzAACDEILLNOORRSTUUYZ';

Which outputs with clipboard:

+----------------------------------------+
| aacdeillnoorrstuuyzAACDEILLNOORRSTUUYZ |
+----------------------------------------+
| áäčďéíĺľňóôŕřšťú ýžÁ ČĎÉÍĹĽŇÓ ŔŘŠŤÚ ÝŽ |
+----------------------------------------+

And with source file:

+----------------------------------------+
| aacdeillnoorrstuuyzAACDEILLNOORRSTUUYZ |
+----------------------------------------+
| ßńŔ´ÚÝňż˛ˇ˘Ó°ÜŁ˙ ř×┴ ╚¤╔═┼╝ĎË └ěŐŹ┌ ŢÄ |
+----------------------------------------+

It is like when importing from files the encoding got screwed. Or while inputing to MySQL console through Keyboard or Clipboard the coding got changed.

So what is going on and how to correct this (without the loss of data)?

  • using < command line option instead of source does not help
  • using -e command line option for source does not help
  • using default charset command line option does not help
  • using UTF8 for non Unicode string lead to Data too long errors and loss of data
  • data in clipboard is the same as in the file

[Edit1]

Well I tried newer version of MySQL 5.7.19 which took mi quite a while to get started as they change the initialization and things (wtf? that insanity got 1.8 GByte without any data !). It behaves the same no matter what I do. So I tried to use UTF8 encoding:

DROP DATABASE IF EXISTS dbs;
CREATE DATABASE dbs CHARACTER SET utf8 COLLATE 'utf8_unicode_ci';
USE dbs;

SET NAMES utf8;

DROP TABLE IF EXISTS `tab`;
CREATE TABLE `tab` (`ix` INT default 0,`nam` VARCHAR(50) default '' ) CHARACTER SET utf8 COLLATE 'utf8_unicode_ci';
INSERT INTO `tab` VALUES
 (1,'áäčďéíĺľňóôŕřšťú ýžÁ ČĎÉÍĹĽŇÓ ŔŘŠŤÚ ÝŽ'),
 (2,'aacdeillnoorrstuuyzAACDEILLNOORRSTUUYZ');
SELECT * FROM `tab`;

#SELECT 'áäčďéíĺľňóôŕřšťú ýžÁ ČĎÉÍĹĽŇÓ ŔŘŠŤÚ ÝŽ' AS 'aacdeillnoorrstuuyzAACDEILLNOORRSTUUYZ';

#SHOW COLLATION;
#SHOW CHARACTER SET;
SHOW VARIABLES LIKE 'char%';

And Yes the script file is converted to UTF8. Now here is my.ini setting:

[mysql]

 default-character-set=utf8

[mysqld]

 skip-character-set-client-handshake
 character-set-server=utf8
 collation-server=utf8_unicode_ci

This finally worked for the file using source here the result:

+------+--------------------------------------------------------------------------+
| ix   | nam                                                                      |
+------+--------------------------------------------------------------------------+
|    1 | áäčďéíĺľňóôŕřšťú ýžÁ ČĎÉÍĹĽŇÓ ŔŘŠŤÚ ÝŽ                                   |
|    2 | aacdeillnoorrstuuyzAACDEILLNOORRSTUUYZ                                   |
+------+--------------------------------------------------------------------------+
+--------------------------+--------------------------------------------------------------------+
| Variable_name            | Value                                                              |
+--------------------------+--------------------------------------------------------------------+
| character_set_client     | utf8                                                               |
| character_set_connection | utf8                                                               |
| character_set_database   | utf8                                                               |
| character_set_filesystem | binary                                                             |
| character_set_results    | utf8                                                               |
| character_set_server     | utf8                                                               |
| character_set_system     | utf8                                                               |
+--------------------------+--------------------------------------------------------------------+
Spektre
  • 49,595
  • 11
  • 110
  • 380
  • Do MySQL and the script agree on a character encoding? Looks like the database is interpreting the script file using a different character encoding than it is saved in. – IInspectable Oct 13 '17 at 21:37
  • @IInspectable Yeah .. but how to change it or force it to do what it should ? The only charset that pass all the entries is latin2 which is consistent with the data source but passing through file breaks it ... like something re-encode it into different charset possibly on Windows side... – Spektre Oct 13 '17 at 21:51
  • Try to save the script file using UTF-8 with a BOM. You can use Notepad to do this. – IInspectable Oct 13 '17 at 21:58

1 Answers1

1

You must specify a CHARACTER SET when creating the table, preferably on the column itself. Otherwise, you get some default from SHOW VARIABLES LIKE 'char%';

SET NAMES establishes the encoding in the client.

When INSERTing or SELECTing, the encoding is changed from the client's encoding (SET NAMES) to the column's (... VARCHAR ... CHARACTER SET ...).

Do you really need latin2? The world is moving to UTF-8.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I added edit1 with UTF8 solution. The latin2 did not work no matter what I did. I wanted to use UTF8 in the first place but I was unable to get it to work with the Old MySQL version (even if it should work but the my.ini settings was ignored somehow). I upgraded to `5.7.19` and UTF8 stuff finally works (latin2 still not however) so I consider this as solved. As the query in your answer lead me to solution I Accept your answer. – Spektre Oct 14 '17 at 13:19
  • 1
    Yeah it is a challenge. As pointed out in "Best practice" of https://stackoverflow.com/questions/38363566/trouble-with-utf-8-characters-what-i-see-is-not-what-i-stored , there are 6 places where you need to specify the character set. Else things break in about 5 different ways. – Rick James Oct 14 '17 at 14:47