I just changed my MySQL database table from utf8 to utf8mb4 so that my CMS could use emoji, etc. I'm finding though that some of the old content now is showing "gremlins" that previously displayed as proper text. I think most of them are from Windows-1252 encoding. Is there a way to efficiently clean those up in MySQL? I thought about doing so in my script with regex whenever I output an entry (see below for regex for the things I've found so far), but it seems like maybe I should do this across the database once and for all instead?
#Fix ugly gremlins from bad encoding.
$hdata =~ s/’/'/sg; # single quote
$hdata =~ s/“…/“/sg; # left smart quote
$hdata =~ s/…/.../sg; # elipsis
$hdata =~ s/–/--/sg; # emdash
$hdata =~ s/“/“/sg; # left smart quote
$hdata =~ s/â€/”/sg; # right smart quote
$hdata =~ s/ –/”/sg; # right smart quote
$hdata =~ s/Â//sg; # Useless?
Those are the ones I've found so far, but maybe there are more? It just seems like this is a messy approach to fixing the problem.
Here's a sample of quoted text with the problematic characters (this is how it appears within the MySQL database itself -- this isn't a corruption caused within Perl):
McDonald’s fries, for approximately 4½ minutes, while they’re absolutely searing hot, are the greatest food on Earth. But their half-life is astoundingly fast, and by the time these babies are cold, they taste like mealy little icicles. The batch I sampled was warm, not piping, so their greatness was compromised. But I love a thin fry and, perhaps more, I love the memory of great McDonald’s fries I’ve had in the past.
Here is what my pattern replacement accomplishes to clean it up:
McDonald's fries, for approximately 4½ minutes, while they're absolutely searing hot, are the greatest food on Earth. But their half-life is astoundingly fast, and by the time these babies are cold, they taste like mealy little icicles. The batch I sampled was warm, not piping, so their greatness was compromised. But I love a thin fry and, perhaps more, I love the memory of great McDonald's fries I've had in the past.
Again, I may be missing other possible problematic characters and I wonder if I should be cleaning it up at the database level rather than in Perl as it is output.