Somehow my database tables changed all my emoji and foreign characters into Mojibake. I'm trying to reverse it by using this function:
UPDATE table SET user_post = convert(cast(convert(user_post using latin1) as binary) using utf8mb4);
It seems that this actually works most of the time. But I am also noticing that large portions of my data are being deleted and I'm errors such as:
Invalid utf8 character string: 'FC6265'
I had to restore my database table because this function is wiping out huge chunks of my user posts, instead of just individual characters. On a table with 500k posts, this might negatively affect 50k rows.
Is there a way to prevent deletion if this function runs into an invalid character that it can't properly convert? Or is there an even better function to convert the Mojibake back into proper characters and emojis?
UPDATE:
I tried a number of things related to this post: Trouble with UTF-8 characters; what I see is not what I stored
I found that these characters appear to be "double encoded" based on HEX
tests
I have tried running the following query on a test product using the CONVERT method:
UPDATE table SET description = IFNULL(CONVERT(CONVERT(CONVERT(description USING latin1) USING binary) USING utf8mb4), description );
But I get an error like the following, and then half the product description gets deleted/truncated:
Warning: #1300 Invalid utf8mb4 character string: 'A02047'
After rolling back the database, I tried the ALTER method (described here: http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases). Since my column is already utf8mb4, I skipped to step 3 of that guide:
Step 3) ALTER TABLE table MODIFY description LONGTEXT CHARSET latin1;
Step 4) ALTER TABLE table MODIFY description LONGBLOB;
Step 5) ALTER TABLE table MODIFY description LONGTEXT CHARSET utf8mb4;
After Step 3, I get a bunch of errors like this (but not every row):
Warning: #1366 Incorrect string value: '\xE2\x86\x91\xE2\x86\x91...' for column 'description' at row 34882
Warning: #1366 Incorrect string value: '\xE2\x86\x91\xE2\x86\x93...' for column 'description' at row 45270
...
After Step 5, I get a bunch of errors like this, and the descriptions also get truncated just like the CONVERT method:
Warning: #1366 Incorrect string value: '\xA0our m...' for column 'description' at row 20450
Warning: #1366 Incorrect string value: '\xA0</div...' for column 'description' at row 20484
UPDATE #2:
To clear up the 'A0' being found, I used the function:
UPDATE table SET description = UNHEX(REPLACE(HEX(description), 'A0', ''));
But I get this error, followed by the result being truncated:
Warning: #1366 Incorrect string value: '\xC2 GO F...' for column 'description' at row 1
The exact text that is actually stored in the database is an HTML formatted string. I'm not sure if you'll be able to see or copy and paste the "hard space" or not after I post it here:
<p><strong><span style="font-size:22px;"><span style="font-family:Arial, Helvetica, sans-serif;">It is covered by the case. GO FIGURE ???</span></span></strong></p>
I believe the "hard space" is right after the word "case.", as everything after that gets truncated when I run the REPLACE query.
UPDATE 3
Here is the HEX before UPDATE:
3C703E3C7374726F6E673E3C7370616E207374796C653D22666F6E742D73697A653A323270783B223E3C7370616E207374796C653D22666F6E742D66616D696C793A417269616C2C2048656C7665746963612C2073616E732D73657269663B223E4E6F746520746865206C657474657265642065646765206973206E6F742076697369626C652E20497420697320636F76657265642062792074686520636173652EC2A020474F20464947555245203F3F3F3C2F7370616E3E3C2F7370616E3E3C2F7374726F6E673E3C2F703E
HEX after UPDATE:
3C703E3C7374726F6E673E3C7370616E207374796C653D22666F6E742D73697A653A323270783B223E3C7370616E207374796C653D22666F6E742D66616D696C793A417269616C2C2048656C7665746963612C2073616E732D73657269663B223E497420697320636F76657265642062792074686520636173652E