2

I missed this statement and I ended up with funny characters in the database:

mysql_enable_utf8 flag documentation

Additionally, turning on this flag tells MySQL that incoming data should be treated as UTF-8. This will only take effect if used as part of the call to connect(). If you turn the flag on after connecting, you will need to issue the command SET NAMES utf8 to get the same effect.

This is a part of the code:

use strict;
use utf8;
use open qw/:std :utf8/;

use DBI;

my $dbh = DBI->connect("...", $user, $pass) or die_report($@);
my $query;

$dbh->{'mysql_enable_utf8'} = 1; #this caused the problem, because it was added after the connect() statement.

$dbh->prepare("CREATE TABLE `$database`.`$table` (`id` INT(8) UNSIGNED NOT NULL AUTO_INCREMENT `response` MEDIUMTEXT) ENGINE = MYISAM DEFAULT CHARSET=utf8 COLLATE utf8_general_ci")->execute or die($@);

my $ua = LWP::UserAgent->new;
$response = $ua->get('http://example.com')->decoded_content;


$query = $dbh->prepare("INSERT INTO `$mysql_database`.`$mysql_table` (`id`, `response`) VALUES (?, ?)");
$query->execute($id, $response);

I need to understand what happens to $response when it is inserted in mysql and how to revert the damage. Does it get double encoded? Can I fix it in a smart way?

Nick
  • 331
  • 3
  • 14
  • 1
    Tip: `"\`$database\`.\`$table\`"` should be `$dbh->quote_identifier($database, $table)`. – ikegami Dec 01 '16 at 18:32
  • 1
    Please provide `HEX(\`response\`)` of corrupted data. – ikegami Dec 01 '16 at 18:46
  • @ikegami, I think the original table is "poisoned", and that's why I'm getting this. Maybe some package that I wrote did not have `use utf8` or something like that, because this is what I have: _Before table ALTER:_ **Entry 1:** **Szabóné** `537A6162C383C2B30000006EC383C2A9000000` **Entry 2:** **Szabóné** `537A6162C3B36EC3A9` _After table ALTER:_ **Entry 1:** **Szabóné** `537A6162C3B30000006EC3A9000000` **Entry 2:** **Szab** `537A6162` – Nick Dec 01 '16 at 20:53
  • LE: I meant to say: Maybe some package that I wrote did not have `use utf8` and I've added it later, or something like that... – Nick Dec 01 '16 at 20:59
  • 1
    Where do ask those NUL come from? This is corrupted by far more than a bad `mysql_enable_utf8` flag – ikegami Dec 01 '16 at 23:00
  • 1
    Are you saying you have one row that contains `537A6162C383C2B30000006EC383C2A9000000` and one that contains `537A6162C3B36EC3A9` and that both should be `Szabóné`? If so, not all rows are corrupt. `537A6162C3B36EC3A9` is what you should have. – ikegami Dec 02 '16 at 15:33
  • Yes they should both have been `Szabóné`, not all rows and corrupt [ _because the code ran in stages, and at some point I might have "fixed" it by accident with an `use utf8;` or something that changes the encoding, without realizing at that point_ ], and `537A6162C3B36EC3A9` is what I should have had. – Nick Dec 02 '16 at 20:31

2 Answers2

1

If Encode::is_utf8($response) was true, and if the connection's encoding (SHOW VARIABLES LIKE "character_set_client") was set to latin1, the following will do the trick:

UPDATE TheTable
   SET response = CONVERT(CONVERT(CONVERT(response USING latin1) USING BINARY) USING utf8)

You can view MySQL strings as bytes with an associated encoding, where BINARY is a special encoding. The following explains how conversions (whether by CONVERT or from storing into a field) work:

  1. Converting from a non-binary encoding to another non-binary encoding converts the underlying bytes.

  2. Converting from a non-binary encoding to binary simply changes the encoding associated with the bytes to binary (without changing the bytes).

  3. Converting from binary to a non-binary encoding simply changes the encoding associated with the bytes to new encoding (without changing the bytes).

So, if response contains the three characters ♡, the following happens:

C3 A2 E2 84 A2 C2 A1 [utf8] (♡)      `response` has an utf8 collation
     |
     |  CONVERT(_ USING latin1)        Changes how the string is encoded (as per #1)
     v
E2 99 A1 [latin1] (♡)
     |
     |  CONVERT(_ USING BINARY)        Changes the associated encoding (as per #2)
     v
E2 99 A1 [BINARY]
     |
     |  CONVERT(_ USING utf8)          Changes the associated encoding (as per #3)
     v
E2 99 A1 [utf8] (♡)
     |
     |  UPDATE SET `response` = _      Changes how the string is encoded (as per #1)
     v
E2 99 A1 [utf8] (♡)                    `response` has an utf8 collation

Technically, since response has an utf8 collation, you could forgo the outermost CONVERT.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • 1
    Tip: Make a backup first! – ikegami Dec 01 '16 at 18:31
  • before asking the question, I've tried the following on a duplicate table: `ALTER TABLE d MODIFY description TEXT CHARACTER SET latin1; ALTER TABLE d MODIFY description TEXT CHARACTER SET binary; ALTER TABLE d MODIFY description TEXT CHARACTER SET utf8;` but I've discovered that some entries got chopped after the following character `"Ã"`[ _this character was present in the original table, after the ALTER of the table, the entry got chopped after this character_ ]. So I thought that I should better ask what really happens in the background, before continuing. – Nick Dec 01 '16 at 18:40
1

53 7A 61 62 C383 C2B3 is "double-encoded". One cure is

CONVERT(BINARY(CONVERT(CONVERT(UNHEX('537A6162C383C2B3')
        USING utf8) USING latin1)) USING utf8) --> 'Szabó'

Or, more relevantly:

CONVERT(BINARY(CONVERT(CONVERT(BINARY(CONVERT('Szabó' USING latin1))
        USING utf8mb4) USING latin1)) USING utf8mb4) --> 'Szabó'

Or, more simply:

UPDATE tbl SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8);

See this and search for "double".
See this and search for "Perl".
Fixes for various cases.

Community
  • 1
  • 1
Rick James
  • 135,179
  • 13
  • 127
  • 222