2

So, for years and years, my PHP application has been connecting to MySQL using the default latin1 charset. Even though I have some fields collated as utf8_general_ci, the actual data that is getting stored into them is some bastardized charset. For example:

Input: ♠ »

is stored as ♠»

Now, when that data is retrieved over the same latin1 connection and displayed on a page with encoding set as utf8, it displays just as it was entered: ♠ » Why this is, I'm not 100% sure, but I'm guessing it's because whatever charset function which is screwing it up going in is fixing it coming out.

I want to fix my data. If I switch my connection charset using mysqli::set_charset('utf8'), the output is displayed as it is stored, i.e. ♠»

So, apparently I need to fix my existing data and then switch my connection charset.

How do I fix the existing bastardized data?

EDIT:

I've discovered a way to emulate the corruption process that is happening in a MySQL query: SELECT CAST(BINARY '♠ »' AS CHAR CHARACTER SET latin1) outputs ♠»

Perhaps if I could figure out how to perform the reverse function I could use that query to fix the existing data.

EDIT 2:

I've discovered such a function: SELECT CAST(BINARY CAST('♠»' AS CHAR CHARACTER SET latin1) AS CHAR CHARACTER SET utf8) outputs ♠ »

My only concern now is what this will do to any data that already happens to be actual utf8 data, which, for some reason, I do have in my database. For example, SELECT CAST(BINARY CAST('♠ »' AS CHAR CHARACTER SET latin1) AS CHAR CHARACTER SET utf8) outputs (nothing)

Dan
  • 744
  • 1
  • 8
  • 23
  • here's [a great way to convert your database](https://www.a2hosting.com/kb/developer-corner/mysql/convert-mysql-database-utf-8) – derelict Sep 01 '16 at 02:41
  • The character set of the fields in question is already `utf8_general_ci`. The problem is the data stored is seemingly corrupted due to the connection having been `latin1` – Dan Sep 01 '16 at 02:45
  • maybe mysqldump and [run the whole file](http://stackoverflow.com/questions/64860/best-way-to-convert-text-files-between-character-sets) through a conversion? or at least the affected tables -- i don't think your actual SQL commands wouldl be affected – derelict Sep 01 '16 at 02:49
  • I'm not sure what the conversion would be, however. I'm not really sure what "charset" the stored data qualifies as. – Dan Sep 01 '16 at 02:59
  • treat it as `latin1` and convert to `utf8_general_ci` – derelict Sep 01 '16 at 03:01
  • See my edit. Is there a way to do that in MySQL directly? – Dan Sep 01 '16 at 03:10

2 Answers2

2

From http://jonisalonen.com/2012/fixing-doubly-utf-8-encoded-text-in-mysql/:

Auto-detecting function for converting maybe-corrupted latin1 text data to utf8:

DELIMITER $$

CREATE FUNCTION maybe_utf8_decode(str text charset utf8) 
RETURNS text CHARSET utf8 DETERMINISTIC
BEGIN
declare str_converted text charset utf8;
declare max_error_count int default @@max_error_count;
set @@max_error_count = 0;
set str_converted = convert(binary convert(str using latin1) using utf8);
set @@max_error_count = max_error_count;
if @@warning_count > 0 then
    return str;
else
    return str_converted;
end if;
END$$

DELIMITER ;

Usage:

update mytable set mycolumn = maybe_utf8_decode(mycolumn);
Dan
  • 744
  • 1
  • 8
  • 23
  • I like the idea of a function to fix improperly encoded data and gave it a try after making a backup. However, it gave an error, *SQL Error (1366): Incorrect string value: '\xA0 Ther...' for column 'str' at row 4 *, in my MySQL client when I ran it. – DonP Dec 30 '19 at 21:48
2

Before attempting to "fix" the data, make sure of what you have. SELECT col, HEX(col) ... -- might be 3 bytes: E299A0, or it might be more: C3A2 E284A2 C2A0. The former is Mojibake; the latter is "double encoding". The repairs are different. More discussion here and here.

Community
  • 1
  • 1
Rick James
  • 135,179
  • 13
  • 127
  • 222