0

Somehow my database tables changed all my emoji and foreign characters into Mojibake. I'm trying to reverse it by using this function:

UPDATE table SET user_post = convert(cast(convert(user_post using latin1) as binary) using utf8mb4);

It seems that this actually works most of the time. But I am also noticing that large portions of my data are being deleted and I'm errors such as:

Invalid utf8 character string: 'FC6265'

I had to restore my database table because this function is wiping out huge chunks of my user posts, instead of just individual characters. On a table with 500k posts, this might negatively affect 50k rows.

Is there a way to prevent deletion if this function runs into an invalid character that it can't properly convert? Or is there an even better function to convert the Mojibake back into proper characters and emojis?

UPDATE:

I tried a number of things related to this post: Trouble with UTF-8 characters; what I see is not what I stored

I found that these characters appear to be "double encoded" based on HEX tests

I have tried running the following query on a test product using the CONVERT method:

UPDATE table SET description = IFNULL(CONVERT(CONVERT(CONVERT(description USING latin1) USING binary) USING utf8mb4), description );

But I get an error like the following, and then half the product description gets deleted/truncated:

Warning: #1300 Invalid utf8mb4 character string: 'A02047'

After rolling back the database, I tried the ALTER method (described here: http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases). Since my column is already utf8mb4, I skipped to step 3 of that guide:

Step 3) ALTER TABLE table MODIFY description LONGTEXT CHARSET latin1;
Step 4) ALTER TABLE table MODIFY description LONGBLOB;
Step 5) ALTER TABLE table MODIFY description LONGTEXT CHARSET utf8mb4;

After Step 3, I get a bunch of errors like this (but not every row):

Warning: #1366 Incorrect string value: '\xE2\x86\x91\xE2\x86\x91...' for column 'description' at row 34882
Warning: #1366 Incorrect string value: '\xE2\x86\x91\xE2\x86\x93...' for column 'description' at row 45270
...

After Step 5, I get a bunch of errors like this, and the descriptions also get truncated just like the CONVERT method:

Warning: #1366 Incorrect string value: '\xA0our m...' for column 'description' at row 20450
Warning: #1366 Incorrect string value: '\xA0</div...' for column 'description' at row 20484

UPDATE #2:

To clear up the 'A0' being found, I used the function:

UPDATE table SET description = UNHEX(REPLACE(HEX(description), 'A0', ''));

But I get this error, followed by the result being truncated:

Warning: #1366 Incorrect string value: '\xC2 GO F...' for column 'description' at row 1

The exact text that is actually stored in the database is an HTML formatted string. I'm not sure if you'll be able to see or copy and paste the "hard space" or not after I post it here:

<p><strong><span style="font-size:22px;"><span style="font-family:Arial, Helvetica, sans-serif;">It is covered by the case.  GO FIGURE ???</span></span></strong></p>

I believe the "hard space" is right after the word "case.", as everything after that gets truncated when I run the REPLACE query.

UPDATE 3

Here is the HEX before UPDATE:

3C703E3C7374726F6E673E3C7370616E207374796C653D22666F6E742D73697A653A323270783B223E3C7370616E207374796C653D22666F6E742D66616D696C793A417269616C2C2048656C7665746963612C2073616E732D73657269663B223E4E6F746520746865206C657474657265642065646765206973206E6F742076697369626C652E20497420697320636F76657265642062792074686520636173652EC2A020474F20464947555245203F3F3F3C2F7370616E3E3C2F7370616E3E3C2F7374726F6E673E3C2F703E

HEX after UPDATE:

3C703E3C7374726F6E673E3C7370616E207374796C653D22666F6E742D73697A653A323270783B223E3C7370616E207374796C653D22666F6E742D66616D696C793A417269616C2C2048656C7665746963612C2073616E732D73657269663B223E497420697320636F76657265642062792074686520636173652E

peppy
  • 173
  • 2
  • 17
  • `↑↑` (a couple of arrows) is E28691E28691 in utf8. The one ending in 93 is `↑↓` – Rick James May 18 '23 at 00:31
  • `A0` is possibly from MS Word -- a "hard space". It is not valid utf8. – Rick James May 18 '23 at 00:32
  • `A02047` Could be that hard space, a normal space, then an upper-G. – Rick James May 18 '23 at 00:35
  • Should that `C2` be `Â`? If so, that is "latin1" trying to be interpreted as utf8. Can you show us the rest of the C2 line. – Rick James May 19 '23 at 17:11
  • `GO FIGURE ???` -- Is that deliberately 3 question marks? Or 1-3 characters that got mangled? – Rick James May 19 '23 at 17:12
  • Please provide hex for the row -- both before and after the `UPDATE`. – Rick James May 19 '23 at 17:12
  • @RickJames In update #3 above, I've added the `hex` before and after `UPDATE` – peppy May 19 '23 at 22:21
  • Those two long strings seem to be hex from different strings, not an UPDATEd string. – Rick James May 19 '23 at 22:57
  • And I do see the `C2A02047` ~= `\xC2 GO F` in the 4th line – Rick James May 19 '23 at 23:02
  • @RickJames I'm not sure what you mean. In the database, I set up a test column "description2". The "description" column contains the original string and "description2" is the truncated portion left over after the `UPDATE table SET description2 = UNHEX(REPLACE(HEX(description), 'A0', ''))`. Then I do `SELECT HEX(description),HEX(description2)` and that is what you see above - the hex before the UPDATE and the hex after the UPDATE. – peppy May 19 '23 at 23:25
  • @RickJames When I'm looking at strings directly in the database (such as through phpMyAdmin or the shell), am I supposed to be seeing the encoding, for example `é` or the actual character `é`? As an emergency for my users (until I solve this problem and underlying cause), I've been doing `UPDATE table SET description = REPLACE(description, 'é', 'é');` I'm not sure if this a sound strategy, or if I'm messing up the database even more. I suspect the original problem might have came from upgrading utf8mb3 to utf8mb4 (and their corresponding collations), though not sure how to fix completely. – peppy May 20 '23 at 02:01
  • No. utf8mb3/4 cannot cause that. Having latin1 somewhere in the dataflow can cause it. Usually, it comes from having latin1 in the _client_ but declaring utf8mbX when connecting. Sorry, but I still don't see enough details to point directly at what went wrong, or how precisely to fix it. – Rick James May 20 '23 at 04:13
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/253754/discussion-between-peppy-and-rick-james). – peppy May 21 '23 at 01:34

1 Answers1

0

Were you expecting übe? (Perhaps in üben?) If so, you have one of these encodings cp1250, cp1256, cp1257, dec8, latin1, latin2, latin5, latin7; probably latin1.

Yet the processing is expecting to see utf8mb4 or utf8mb3.

You have what might be the "opposite" of Mojibake.

Maybe this will work?

CONVERT(CONVERT(UNHEX('FC6265') USING latin1) USING utf8mb4)

But I strongly recommend first testing it with SELECT, not UPDATE.

See also Trouble with UTF-8 characters; what I see is not what I stored

No current UTF-8 character includes the byte hex FC.

A0

This might work to remove the A0 garbage:

UNHEX(REPLACE(HEX(col), 'A0', ''))

(But be ready to rollback the table.) In particular, it may mess up a line ending in any of these six characters: *:JZjz.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I still cannot solve it. I updated the original post, added things I tried from your website along with error messages. The CONVERT and the ALTER methods are truncating a number of my product descriptions. Maybe you can make out what the errors mean, if there is a simple fix to this if I am getting closer. – peppy May 18 '23 at 00:00
  • Truncation is caused by another error. – Rick James May 18 '23 at 00:27
  • See Comments on your Question. You do seem to have multiple causes. Therefore, cleaning up the text in a single pass will probably be impossible. Truncation and question marks imply lost data. Other stuff might be fixable by hand-editing. – Rick James May 18 '23 at 00:39
  • I added Update #2 above. I tried `UNHEX(REPLACE(HEX(col), 'A0', ''))` on a simple example, but I'm getting an error followed by truncation. I pasted the exact string that is in my database so you can examine it. – peppy May 18 '23 at 01:48