So, for years and years, my PHP application has been connecting to MySQL using the default latin1
charset. Even though I have some fields collated as utf8_general_ci
, the actual data that is getting stored into them is some bastardized charset. For example:
Input: ♠ »
is stored as ♠»
Now, when that data is retrieved over the same latin1
connection and displayed on a page with encoding set as utf8
, it displays just as it was entered: ♠ »
Why this is, I'm not 100% sure, but I'm guessing it's because whatever charset function which is screwing it up going in is fixing it coming out.
I want to fix my data. If I switch my connection charset using mysqli::set_charset('utf8')
, the output is displayed as it is stored, i.e. ♠»
So, apparently I need to fix my existing data and then switch my connection charset.
How do I fix the existing bastardized data?
EDIT:
I've discovered a way to emulate the corruption process that is happening in a MySQL query:
SELECT CAST(BINARY '♠ »' AS CHAR CHARACTER SET latin1)
outputs♠»
Perhaps if I could figure out how to perform the reverse function I could use that query to fix the existing data.
EDIT 2:
I've discovered such a function:
SELECT CAST(BINARY CAST('♠»' AS CHAR CHARACTER SET latin1) AS CHAR CHARACTER SET utf8)
outputs♠ »
My only concern now is what this will do to any data that already happens to be actual utf8 data, which, for some reason, I do have in my database. For example,
SELECT CAST(BINARY CAST('♠ »' AS CHAR CHARACTER SET latin1) AS CHAR CHARACTER SET utf8)
outputs (nothing)