13

I've got a big ol' MySQL 5.1 database, and for a variety of stupid reasons, I've been storing, I believe, UTF8 characters encoded as LATIN1 in a UTF8 table. It's... strange. And I'd like to fix it.

The MySQL - Convert latin1 characters on a UTF8 table into UTF8 question seems to work -- a column at a time. But I have 24 tables and dozens of columns to convert. I'm really looking for a solution that'll convert at least a table at once.

For reference, the single-column solution that works for me is:

UPDATE foo SET col1 = CONVERT(CAST(CONVERT(col1 USING latin1) AS binary) USING utf8);

For tables, I can do:

ALTER TABLE foo CONVERT TO CHARACTER SET latin1;
ALTER TABLE foo CONVERT TO CHARACTER SET binary;
ALTER TABLE foo CHARACTER SET utf8  COLLATE utf8_unicode_ci;

which gets me very close -- however, the CONVERT TO CHARACTER SET binary step turns all my VARCHAR columns into VARBINARY and my TEXT columns into BLOBs in one fell swoop. I can go through and change them back and all appears to be well... but then I'm back in the "let's modify all the columns individually" world -- in which case, I may just as well

I've tried about 50 variations on those SQL statements, but I can't find one that both leaves my columns in character data types and encodes the data properly.

Any suggestions?

Update: Deciding to just fix the columns rather than waiting for a database- or table-wise solution, I came up with:

#!/usr/bin/env ruby
require 'rubygems'
require 'mysql2'

CONNECT_OPTS = {} # whatever you want
Mysql2::Client.default_query_options.merge!(:as => :array)
conn = Mysql2::Client.new(CONNECT_OPTS)

tables = conn.query("SHOW TABLES").map {|row| row[0] }

# See http://dev.mysql.com/doc/refman/5.0/en/charset-column.html
# One might want to include enum and set columns; I don't have them
TYPES_TO_CONVERT = %w(char varchar text)
tables.each do |table|
  puts "converting #{table}"
  # Get all the columns and we'll filter for the ones we want
  columns = conn.query("DESCRIBE #{table}")
  columns_to_convert = columns.find_all {|row|
    TYPES_TO_CONVERT.include? row[1].gsub(/\(\d+\)/, '')
  }.map {|row| row[0]}
  next if columns_to_convert.empty?

  query = "UPDATE `#{table}` SET "
  query += columns_to_convert.map {|col|
    "`#{col}` = convert(cast(convert(`#{col}` using latin1) as binary) using utf8)"
  }.join ", "
  puts query
  conn.query query
end

... which gets the job done. Amusingly, this runs on my database in 36 seconds, rather than the ALTER TABLE route which took 13 minutes (and had the VARBINARY problem) or the mysqldump solutions which would take upwards of twenty assuming I could get them to run.

I'll still accept an answer if someone knows an elegant way to do this for a whole database or table in one step.

Community
  • 1
  • 1
Nate
  • 4,561
  • 2
  • 34
  • 44
  • Make a mysqldump, say per table, convert it (native2ascii -reverse), replace possibly a bit SQL relating to the encodings and reload the table. – Joop Eggen Jun 11 '13 at 17:46
  • I've had no luck whatsoever trying to convert mysqldump files with iconv or native2ascii -- mysqldump does its own character conversion, I think. I'm pretty sure I need to convert the contents of the columns. Also: it's a 1.5GB database dump and dumping and loading it takes like 45 minutes. – Nate Jun 12 '13 at 02:07
  • I ported a same sized database from Latin 1 to UTF-8 by individual table dumps and also splitting an entire dump two years ago. I alsol did some text replacements to change the charset. I took this solution to have a backup too - but yes, it took time. – Joop Eggen Jun 12 '13 at 07:30

1 Answers1

13

This method below looks really promising & better yet, beautiful in its simplicity. The idea is you mysqldump your entire database as latin1, & then import it re-encoded as utf-8.

Export:

mysqldump -u [user] -p --opt --quote-names --skip-set-charset --default-character-set=latin1 [database] > dump.sql

Import:

mysql -u [user] -p --default-character-set=utf8 [database] < dump.sql

I take no credit for this solution, it's completely from Gareth Price's blog. It has worked for everyone who has left him a comment so far: "Wow man you just saved my life. I did not spent 2 hours on this, but 2 days" caught my attention.

Update #1: Looks like Gareth wasn't the first to discover this.

Update #2: I just tried this & it worked beautifully for my UTF8-stored-as-latin1 database. Just make sure you switch the default charset on your database to utf8 before importing, or else you'll end up with plain question marks where the special characters were. Of course this might have plenty of other ramifications so test like hell first.

ALTER SCHEMA [database] DEFAULT CHARACTER SET utf8;

And if you have any tables that aren't set to the schema default:

ALTER TABLE [table] CHARACTER SET = DEFAULT;

(same idea if you have any column-specific charset settings, you'll have to do a ALTER TABLE [table] CHANGE COLUMN [settings] without specifying CHARACTER SET so it goes back to the table default)

Wick
  • 1,222
  • 2
  • 15
  • 21
  • The mysqldump trick is great if your database is fairly small; however, it takes a long long time if your database is big. – Nate Apr 08 '14 at 15:43
  • We ran it on our apparently-not-that-big database (tens of millions of records, several GB dumped) & it did okay.. took some time but minutes, not hours. – Wick Apr 10 '14 at 14:11