PHP/MySQL: Fixing utf8 text corrupted by implicit mysqli::set_charset('latin1') connection

Question

So, for years and years, my PHP application has been connecting to MySQL using the default latin1 charset. Even though I have some fields collated as utf8_general_ci, the actual data that is getting stored into them is some bastardized charset. For example:

Input: ♠ »

is stored as â™ Â»

Now, when that data is retrieved over the same latin1 connection and displayed on a page with encoding set as utf8, it displays just as it was entered: ♠ » Why this is, I'm not 100% sure, but I'm guessing it's because whatever charset function which is screwing it up going in is fixing it coming out.

I want to fix my data. If I switch my connection charset using mysqli::set_charset('utf8'), the output is displayed as it is stored, i.e. â™ Â»

So, apparently I need to fix my existing data and then switch my connection charset.

How do I fix the existing bastardized data?

EDIT:

I've discovered a way to emulate the corruption process that is happening in a MySQL query: SELECT CAST(BINARY '♠ »' AS CHAR CHARACTER SET latin1) outputs â™ Â»

Perhaps if I could figure out how to perform the reverse function I could use that query to fix the existing data.

EDIT 2:

I've discovered such a function: SELECT CAST(BINARY CAST('â™ Â»' AS CHAR CHARACTER SET latin1) AS CHAR CHARACTER SET utf8) outputs ♠ »

My only concern now is what this will do to any data that already happens to be actual utf8 data, which, for some reason, I do have in my database. For example, SELECT CAST(BINARY CAST('♠ »' AS CHAR CHARACTER SET latin1) AS CHAR CHARACTER SET utf8) outputs (nothing)

here's [a great way to convert your database](https://www.a2hosting.com/kb/developer-corner/mysql/convert-mysql-database-utf-8) — derelict, Sep 01 '16 at 02:41
The character set of the fields in question is already `utf8_general_ci`. The problem is the data stored is seemingly corrupted due to the connection having been `latin1` — Dan, Sep 01 '16 at 02:45
maybe mysqldump and [run the whole file](http://stackoverflow.com/questions/64860/best-way-to-convert-text-files-between-character-sets) through a conversion? or at least the affected tables -- i don't think your actual SQL commands wouldl be affected — derelict, Sep 01 '16 at 02:49
I'm not sure what the conversion would be, however. I'm not really sure what "charset" the stored data qualifies as. — Dan, Sep 01 '16 at 02:59

score 2 · Accepted Answer · answered Sep 01 '16 at 03:46

From http://jonisalonen.com/2012/fixing-doubly-utf-8-encoded-text-in-mysql/:

Auto-detecting function for converting maybe-corrupted latin1 text data to utf8:

DELIMITER $$

CREATE FUNCTION maybe_utf8_decode(str text charset utf8) 
RETURNS text CHARSET utf8 DETERMINISTIC
BEGIN
declare str_converted text charset utf8;
declare max_error_count int default @@max_error_count;
set @@max_error_count = 0;
set str_converted = convert(binary convert(str using latin1) using utf8);
set @@max_error_count = max_error_count;
if @@warning_count > 0 then
    return str;
else
    return str_converted;
end if;
END$$

DELIMITER ;

Usage:

update mytable set mycolumn = maybe_utf8_decode(mycolumn);

I like the idea of a function to fix improperly encoded data and gave it a try after making a backup. However, it gave an error, *SQL Error (1366): Incorrect string value: '\xA0 Ther...' for column 'str' at row 4 *, in my MySQL client when I ran it. — DonP, Dec 30 '19 at 21:48

score 2 · Answer 2 · edited May 23 '17 at 11:51

2

Before attempting to "fix" the data, make sure of what you have. SELECT col, HEX(col) ... -- ♠ might be 3 bytes: E299A0, or it might be more: C3A2 E284A2 C2A0. The former is Mojibake; the latter is "double encoding". The repairs are different. More discussion here and here.

edited May 23 '17 at 11:51

Community

1
1

answered Sep 01 '16 at 18:46

Rick James

135,179
13
127
222

PHP/MySQL: Fixing utf8 text corrupted by implicit mysqli::set_charset('latin1') connection

2 Answers2