209

Our previous programmer set the wrong collation in a table (Mysql). He set it up with Latin collation, when it should be UTF8, and now I have issues. Every record with Chinese and Japan character turn to ??? character.

Is possible to change collation and get back the detail of character?

Rick James
  • 135,179
  • 13
  • 127
  • 222
Jeg Bagus
  • 4,895
  • 9
  • 43
  • 54
  • possible duplicate of [MySql alter table Collation](http://stackoverflow.com/questions/742205/mysql-alter-table-collation) – kenorb Mar 03 '15 at 14:45
  • 1
    What does collation have to do with '???' character set? I thought that was to do with the character set? – Peter Chaula Jun 24 '19 at 13:23
  • I am changing the title to reflect the intent. Changing the _default_ collation for a _database_ is much less than was desired. – Rick James Dec 18 '19 at 19:45

5 Answers5

435

change database collation:

ALTER DATABASE <database_name> CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

change table collation:

ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

change column collation:

ALTER TABLE <table_name> MODIFY <column_name> VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

What do the parts of utf8mb4_0900_ai_ci mean?

3 bytes -- utf8
4 bytes -- utf8mb4 (new)
v4.0 --   _unicode_
v5.20 --  _unicode_520_
v9.0 --   _0900_ (new)
_bin      -- just compare the bits; don't consider case folding, accents, etc
_ci       -- explicitly case insensitive (A=a) and implicitly accent insensitive (a=á)
_ai_ci    -- explicitly case insensitive and accent insensitive
_as (etc) -- accent-sensitive (etc)
_bin         -- simple, fast
_general_ci  -- fails to compare multiletters; eg ss=ß, somewhat fast
...          -- slower
_0900_       -- (8.0) much faster because of a rewrite

More info:

Timo Huovinen
  • 53,325
  • 33
  • 152
  • 143
  • 5
    Beware `CHARACTER SET utf8` will default to `utf8_general_ci` but you can also define the collation like this `ALTER DATABASE CHARACTER SET utf8 COLLATE utf8_unicode_ci;` if needed – KCD Feb 17 '14 at 08:10
  • 1
    ...and I recommend you test it `create table testit(a varchar(1)); show create table testit \G drop table testit;` – KCD Feb 17 '14 at 08:13
  • 2
    Just want to mention that the second will change the collation to `utf8_general_ci`; if you want to change it to `utf8_unicode_ci`, you can define collation: `ALTER TABLE CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;`. This works on tables exactly the same as it works on databases, as @KCD has pointed out. – wiser Apr 28 '15 at 04:01
  • 9
    It is better to do the following for full utf8 support `ALTER DATABASE CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci`. You should do the same for the other two statements. – Greeso Nov 23 '15 at 16:02
  • 1
    Do you really need to use "ALTER TABLE MODIFY ...". According to https://dev.mysql.com/doc/refman/5.5/en/alter-table.html it seems "ALTER TABLE CONVERT TO CHARACTER SET ..." also changes the columns? Or maybe I'm not reading/understanding the manual correctly. – hansfn Feb 06 '17 at 01:33
  • Sometimes the collation is set on the table columns. As a result of that, any changes you make on the database won't have any effect on the columns in that database's tables – Peter Chaula Jun 24 '19 at 13:22
65

Here's how to change all databases/tables/columns. Run these queries and they will output all of the subsequent queries necessary to convert your entire database to character encoding utf8mb4 and collations to the MySQL 8 default of utf8mb4_0900_ai_ci. Hope this helps!

-- Change DATABASE Default Collation

SELECT
CONCAT('ALTER DATABASE `', SCHEMA_NAME,'` CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;')
FROM information_schema.SCHEMATA
WHERE SCHEMA_NAME
NOT IN ('sys','mysql','information_schema','performance_schema','innodb')
AND SCHEMA_NAME LIKE 'database_name';

Note that changing the schema default changes the default for new tables (and their columns). It does not modify existing columns of existing tables.

-- Change TABLE Collation / Char Set

SELECT
CONCAT('ALTER TABLE `', TABLE_SCHEMA, '`.`', TABLE_NAME, '` CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;')
FROM information_schema.TABLES
WHERE TABLE_SCHEMA
NOT IN ('sys','mysql','information_schema','performance_schema','innodb')
AND TABLE_TYPE = 'BASE TABLE'
AND TABLE_SCHEMA LIKE 'database_name';

Note that changing the table default changes the default for new columns. It does not modify existing columns on existing tables.

-- Change COLUMN Collation / Char Set

SELECT
CONCAT('ALTER TABLE `', TABLE_SCHEMA, '`.`', TABLE_NAME, '` MODIFY COLUMN `', COLUMN_NAME, '` ', COLUMN_TYPE,
              IF(COLUMN_DEFAULT IS NOT NULL, CONCAT(' DEFAULT \'', COLUMN_DEFAULT, '\''), ''),
              IF(IS_NULLABLE = 'YES', ' NULL ', ' NOT NULL '),
              ' COLLATE utf8mb4_0900_ai_ci;')
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA
NOT IN ('sys','mysql','information_schema','performance_schema','innodb')
AND COLLATION_NAME IS NOT NULL
AND TABLE_SCHEMA LIKE 'database_name'
AND COLLATION_NAME = 'old_collation_name';

This changes the actual columns and the database behaviour on queries. It does not, however, convert the data if the data is not in a compatible collation/character set. See https://dev.mysql.com/blog-archive/mysql-8-0-collations-migrating-from-older-collations/ for details on migrating from older collations. We also assume here that your default values do not include a single quote - it would need to be escaped - and we ensure that COLLATION_NAME is not NULL to exclude columns with integers, timestamps etc.

We filter out the built-in system schemas such as sys and mysql in all three cases, as these should likely not be modified unless you have an explicit reason to do so.

nickdnk
  • 4,010
  • 4
  • 24
  • 43
David Whittaker
  • 671
  • 5
  • 3
  • Good. ! It is about an hour that i' m trying to solve the same problem. I use these 3 commands, and i saw that the charset had change. But the the main problem is remaining for me. If i wrote directly to database then everything show well in my browser. But if i added some content from the website form, the result in database is just ??????. Are there anything i should consider? My web app is a .NET MVC app. – Tchaps Jul 11 '15 at 16:16
  • Saving into useful queries for future projects. – Manatax Mar 01 '16 at 23:08
  • I've suggested some edits because these automated queries weren't quite safe to use yet. There's still a problem with the CHARACTER_MAXIMUM_LENGTH: The original one can be too high when you change from e.g. latin1_swedish_ci to utf8_unicode_ci. – Ruben May 03 '16 at 15:06
  • 1
    This is an excellent answer. I have three comments/questions: 1) Why the use of "t1" in the COLUMN-code? I don't see any need for it. 2) Why "t1.data_type , '(' , t1.CHARACTER_MAXIMUM_LENGTH , ')'" and not just "t1.column_type"? 3) Why the mixture of upper case and lower case - TABLE_SCHEMA vs table_name and so on? – hansfn Feb 06 '17 at 00:41
  • Beware that this changes the columns to be NULLable and removes its defaults (at least in MySQL). – Jan Pešek Jun 01 '21 at 09:47
  • 1
    @JanPešek I just edited it to not suffer from this problem. – nickdnk Jun 29 '22 at 11:13
30

Beware that in Mysql, the utf8 character set is only a subset of the real UTF8 character set. In order to save one byte of storage, the Mysql team decided to store only three bytes of a UTF8 characters instead of the full four-bytes. That means that some east asian language and emoji aren't fully supported. To make sure you can store all UTF8 characters, use the utf8mb4 data type, and utf8mb4_bin or utf8mb4_general_ci in Mysql.

bluecollarcoder
  • 14,339
  • 4
  • 22
  • 18
  • 1
    By now, it's recommended to use `utf8mb4_unicode_ci` instead of `utf8mb4_general_ci`. See https://stackoverflow.com/questions/766809/whats-the-difference-between-utf8-general-ci-and-utf8-unicode-ci and https://drupal.stackexchange.com/questions/166405/why-are-we-using-utf8mb4-general-ci-and-not-utf8mb4-unicode-ci – Robin van Baalen Aug 23 '19 at 20:43
8

Adding to what David Whittaker posted, I have created a query that generates the complete table and columns alter statement that will convert each table. It may be a good idea to run

SET SESSION group_concat_max_len = 100000;

first to make sure your group concat doesn't go over the very small limit as seen here.

     SELECT a.table_name, concat('ALTER TABLE ', a.table_schema, '.', a.table_name, ' DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci, ',
        group_concat(distinct(concat(' MODIFY ',  column_name, ' ', column_type, ' CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci ', if (is_nullable = 'NO', ' NOT', ''), ' NULL ',
        if (COLUMN_DEFAULT is not null, CONCAT(' DEFAULT \'', COLUMN_DEFAULT, '\''), ''), if (EXTRA != '', CONCAT(' ', EXTRA), '')))), ';') as alter_statement
    FROM information_schema.columns a
    INNER JOIN INFORMATION_SCHEMA.TABLES b ON a.TABLE_CATALOG = b.TABLE_CATALOG
        AND a.TABLE_SCHEMA = b.TABLE_SCHEMA
        AND a.TABLE_NAME = b.TABLE_NAME
        AND b.table_type != 'view'
    WHERE a.table_schema = ? and (collation_name = 'latin1_swedish_ci' or collation_name = 'utf8mb4_general_ci')
    GROUP BY table_name;

A difference here between the previous answer is it was using utf8 instead of ut8mb4 and using t1.data_type with t1.CHARACTER_MAXIMUM_LENGTH didn't work for enums. Also, my query excludes views since those will have to altered separately.

I simply used a Perl script to return all these alters as an array and iterated over them, fixed the columns that were too long (generally they were varchar(256) when the data generally only had 20 characters in them so that was an easy fix).

I found some data was corrupted when altering from latin1 -> utf8mb4. It appeared to be utf8 encoded latin1 characters in columns would get goofed in the conversion. I simply held data from the columns I knew was going to be an issue in memory from before and after the alter and compared them and generated update statements to fix the data.

Community
  • 1
  • 1
Jacob Hundley
  • 81
  • 1
  • 3
4

here describes the process well. However, some of the characters that didn't fit in latin space are gone forever. UTF-8 is a SUPERSET of latin1. Not the reverse. Most will fit in single byte space, but any undefined ones will not (check a list of latin1 - not all 256 characters are defined, depending on mysql's latin1 definition)

Taryn
  • 242,637
  • 56
  • 362
  • 405
MJB
  • 9,352
  • 6
  • 34
  • 49