2

I want to convert my database to store unicode symbols.

Currently the tables have:

latin_swedish_ci collation and latin1 character set

OR

utf8_general_ci collation and utf8 character set

I am not sure how the existing data is encoded, but I suppose it is utf-8 encoded, as I am using Django which I think encodes the data in utf-8 before sending to the database.

My question is: Can I convert the tables to utf8_unicode_ci collation and utf-8 character set using the following queries without messing up the existing data? (as sugested in this post)

ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;

ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Considering latin1 is subset of utf-8, I think it sould work. What do you guys think?

Thank you in advance.

P.S: The version of MySQL is: 5.1

Community
  • 1
  • 1

2 Answers2

0

Latin1 is not a subset of UTF-8 - ASCII is. Latin1, however, is represented in Unicode.

CONVERT TO should work, as long as the data was stored in the correct encoding in the first place. Django may have used UTF-8 on the database connection, but the database should have re-encoded on the fly.

To check the actual encoding used - Use the mysql command-line tool to execute an SQL query that selects a row that you know contains non-ASCII characters. Then use the mysql HEX() function to check the bytes used. If you see bytes greater than > 0x7f, check that they don't correspond to valid characters in https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
0

If you have c396 sitting in a latin1 column, and you want it to mean Ö, then you are half way to "double encoding". Do not use CONVERT TO; that will really get you into "double encoding".

Instead, you need the 2-step ALTER.

ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;

If you have already messed it up further, and now the Ö is hex C383E28093, then you need to fix double encoding.

This gets you the latin1 byte in 2 steps:

CONVERT(CONVERT(UNHEX('C383E28093') USING utf8) USING latin1) --> 'Ö' (C396)
HEX(CONVERT(CONVERT(UNHEX('C396') USING utf8) USING latin1)) --> 'Ö' in latin1 (D6)

This gets you the 2-byte utf8 encoding:

CONVERT(BINARY(CONVERT(CONVERT(UNHEX('C383E28093') USING utf8) USING latin1)) USING utf8)

Do you want the column to be latin1? Or utf8?

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I want all tables to be utf-8 encoded. I had `c396` sitting in `latin1` column which meant `Ö`. Then I used `ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;` (no errors or warnings displayed by `MySQL`). Now when I use HEX() on the row with value `Ö` it returns `c396` again. – Krasimir Atanasov Feb 15 '16 at 08:38
  • This means I didn't do "double encoding", although I had run `CONVERT TO` . Also I can store unicode symbols in this table now. It seems I did the right thing. Or no? – Krasimir Atanasov Feb 15 '16 at 08:45
  • I don't know what you did exactly, but I tried it and got hex `C383E28093` -- double-encoding. Please show me `SHOW CREATE TABLE`. – Rick James Feb 15 '16 at 18:02