1

I just changed my MySQL database table from utf8 to utf8mb4 so that my CMS could use emoji, etc. I'm finding though that some of the old content now is showing "gremlins" that previously displayed as proper text. I think most of them are from Windows-1252 encoding. Is there a way to efficiently clean those up in MySQL? I thought about doing so in my script with regex whenever I output an entry (see below for regex for the things I've found so far), but it seems like maybe I should do this across the database once and for all instead?

#Fix ugly gremlins from bad encoding.
$hdata =~ s/’/'/sg;       # single quote
$hdata =~ s/“…/“/sg;    # left smart quote
$hdata =~ s/…/.../sg;     # elipsis
$hdata =~ s/–/--/sg;      # emdash
$hdata =~ s/“/“/sg;       # left smart quote
$hdata =~ s/â€/”/sg;        # right smart quote
$hdata =~ s/ –/”/sg;      # right smart quote
$hdata =~ s/Â//sg;          # Useless?

Those are the ones I've found so far, but maybe there are more? It just seems like this is a messy approach to fixing the problem.

Here's a sample of quoted text with the problematic characters (this is how it appears within the MySQL database itself -- this isn't a corruption caused within Perl):

McDonald’s fries, for approximately 4½ minutes, while they’re absolutely searing hot, are the greatest food on Earth. But their half-life is astoundingly fast, and by the time these babies are cold, they taste like mealy little icicles. The batch I sampled was warm, not piping, so their greatness was compromised. But I love a thin fry and, perhaps more, I love the memory of great McDonald’s fries I’ve had in the past.

Here is what my pattern replacement accomplishes to clean it up:

McDonald's fries, for approximately 4½ minutes, while they're absolutely searing hot, are the greatest food on Earth. But their half-life is astoundingly fast, and by the time these babies are cold, they taste like mealy little icicles. The batch I sampled was warm, not piping, so their greatness was compromised. But I love a thin fry and, perhaps more, I love the memory of great McDonald's fries I've had in the past.

Again, I may be missing other possible problematic characters and I wonder if I should be cleaning it up at the database level rather than in Perl as it is output.

Timothy R. Butler
  • 1,097
  • 7
  • 20
  • All of those characters are in the 0x80..0x9F block of [cp1252](https://en.wikipedia.org/wiki/Windows-1252), a range people frequently mishandle. There are other characters in that range. – ikegami Mar 12 '20 at 08:12
  • I don't see how we can answer your question, since you didn't specify what you want in `$hdata`. (Decoded text aka Unicode Code Points? UTF-8?) Having the output of `sprintf('%vX', $hdata)` for a bad string both before and after you fix it should help – ikegami Mar 12 '20 at 08:12
  • As it is, the code you posted doesn't make sense. As far as Perl is concerned, none of the characters in your literals (`â`, `€`, `“`, etc) can occur in source code unless `use utf8;` is present, but you didn't show `use utf8;` being used. So I have no idea if you're actually replacing those characters or something else. – ikegami Mar 12 '20 at 08:12
  • @ikegami, yes, I have use utf8 turned on way back at the beginning of the script. The bad characters are actually in MySQL itself (if I look at the database using phpMySQL, they are present). I'll edit the original question with a sample, but yes, it does seem to be cp1252. Problematic text is text that was originally pasted from a different source, but the issue only arose when I switched to utf8mb4 from utf8 in the MySQL table. – Timothy R. Butler Mar 12 '20 at 15:03
  • Do you decode what you get from the DB (e.g. using `mysql_enable_utf8mb4 => 1`)? – ikegami Mar 12 '20 at 15:13
  • Sorry, what you say you see is not good enough since it really it's easy to get wrong. That's why I asked for `sprintf '%vX'`. But, since you're using `use utf8;` that's not necessary. What I wonder is is how do the string appear when using phpmyadmin. Do they appear correctly using phpmyadmin or other tools? (In other words, is the problem bad data in the DB, or some kind of decoding issue) – ikegami Mar 12 '20 at 15:14
  • Also check the session charset for anything you use to look at these contents, since MySQL will automatically translate text contents to the connection charset when returning it. mysql_enable_utf8mb4 or using the better DBD::MariaDB driver will automatically set this charset to utf8mb4. It is possible the contents were always bad, but the connection charset was not what you thought you were getting, and that is now corrected. Or it is possible the opposite is true, and the contents are actually still fine. – Grinnz Mar 12 '20 at 17:10
  • If the contents are still fine, the solution is to correctly align the connection charset and the code that receives that encoding (the options I mentioned earlier should handle this fine, so it is more likely the actual data is corrupted). If this is corruption in the stored data, it was probably caused by passing UTF-8 data with a connection charset of cp1252 (regardless of the column charset, since mysql will translate it from the connection charset). In this case the way to fix it depends whether the charset of the column is latin1 (mysql's name for cp1252) or utf8mb4. – Grinnz Mar 12 '20 at 17:17
  • If the data is corrupted and the column charset is utf8/mb4, you can fix it with the following steps: 1) change the column charset to latin1. this will internally decode it from UTF-8 and encode it to cp1252, 2) convert the column to a binary column, like blob, which will store the underlying bytes. make sure it's large enough, 3) convert the column back to a text column with charset utf8mb4, this will treat the bytes as UTF-8 once again. – Grinnz Mar 12 '20 at 17:24
  • Needless to say, it's a good idea to back up your data before trying this. – Grinnz Mar 12 '20 at 17:24
  • Thank you @ikegami and grinnz. Yes, the utf8mb4 switch is turned on for DBD. Emoji and other UTF8 characters show up just fine in Perl output now. The corruption shows up in phpMySQL, not just in Perl. The table was in latin1 years ago, then moved to UTF8 then UTF8mb4, so Grinnz's suggestion sounds intriguing. The characters looked just fine until the conversion to UTF8mb4. I'll try backing things up (I do have a backup of the pre-conversion data) and running that. – Timothy R. Butler Mar 13 '20 at 19:13
  • See Mojibake in https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored – Rick James Mar 18 '20 at 03:39

0 Answers0