You have to remember that computers handle all forms of data as nothing more than sequences of 1s and 0s. In order to turn those 1s and 0s into something meaningful, the computer must somehow be told how those bits should be interpreted.
When it comes to a textual string, such information regarding its bits' interpretation is known as its character encoding. For example, the bit sequence 111000101000000010010011
, which for brevity I will express in hexadecimal notation as 0xe28093
, is interpreted under the UTF-8 character encoding to be your boss's much-loved U+2013 (EN-DASH); however that same sequence of bits could mean absolutely anything under a different encoding: indeed, under the ISO-8859-1 encoding (for example), it represents a sequence of three characters: U+00E2 (LATIN SMALL LETTER A WITH CIRCUMFLEX), U+0080 (<control>) and U+0093 (SET TRANSMIT STATE).
Unfortunately, in their infinite wisdom, PHP's developers decided not to keep track of the encoding under which your string variables are stored—that is left to you, the application developer. Worse still, many PHP functions make arbitrary assumptions about the encoding of your variables, and they happily go ahead manipulating your bits without any thought of the consequences.
So, when you call utf8_decode
on a string: it takes whatever bits you provide, works out what characters they happen to represent in UTF-8, and then returns to you those same characters encoded in ISO-8859-1. It's entirely possible to come up with an input sequence that, when passed to this function, produces absolutely any given result; indeed, if you provide as input 0xc3a2c280c293
(which happens to be the UTF-8 encoding of the three characters mentioned above), it will produce a result of 0xe28093
—the UTF-8 encoding of an "en dash"!
Such double encoding (i.e. UTF-8 encoded, treated as ISO-8859-1 and transcoded to UTF-8) appears to be what you're retrieving from MySQL when you do not call mysql_set_charset
(in such circumstances, MySQL transcodes results to whatever character set the client specifies upon connection—the standard drivers use latin1
unless you override their default configuration). In order for a result that MySQL transcodes to latin1
to produce such double encoded UTF-8, the value that is actually stored in your column must have been triple encoded (i.e. UTF-8 encoded, treated as ISO-8859-1, transcoded to UTF-8, then treated as latin1
again)!
You need to fix the data that is stored in your database:
Identify exactly how the incumbent data has actually been encoded. Some values may well be triple-encoded as described above, but others (perhaps that predate particular changes to your application code; or that were inserted/updated from a different source) may be encoded in some other way. I find SELECT HEX(myColumn) FROM myTable WHERE ...
to be very useful for this purpose.
Correct the encodings of those values that are currently incorrect: e.g. UPDATE myTable SET myColumn = BINARY CONVERT(myColumn USING latin1) WHERE ...
—if an entire column is misencoded, you can instead use ALTER TABLE
to change it to a binary string type and then back to a character string of the correct encoding. Beware of transformations that increase the encoded length, as the result might overflow your existing column size.