1

i'm a little confused and could't find an answer. So i hope to find help here.

I have a little mysql table:

CREATE TABLE `datatable` (
  `Artikelnummer` varchar(10) NOT NULL,
  `Bezeichnung` varchar(25) NOT NULL,
  `Länge` mediumint(6) DEFAULT NULL,
  `Breite` mediumint(6) DEFAULT NULL,
  `Höhe` mediumint(6) DEFAULT NULL,
  `Vö-Datum` date DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

INSERT INTO `datatable` 
(`Artikelnummer`, `Bezeichnung`, `Länge`, `Breite`, `Höhe`, `Vö-Datum`)
VALUES ('123456', 'Hängematte', 12, 20, 35, '2020-08-31');

ALTER TABLE `datatable`
  ADD PRIMARY KEY (`Artikelnummer`);

and my perl test script is :

#!/usr/bin/perl

use warnings;
use strict;
use utf8;

use CGI qw (:standard);
use DBI;
use Data::Dumper;
use DBIx::Log4perl;

binmode(STDOUT, ':utf8');

my $dbh = '';

if ($dbh = DBIx::Log4perl->connect("DBI:mysql:test","user","password",{
            RaiseError => 1,
            PrintError => 1,
            mysql_enable_utf8 => 1
        }))
{
    $dbh->do('SET NAMES utf8');
    $dbh->do('SET CHARSET utf8');

    my $sql_query
        = 'SELECT * FROM datatable WHERE Artikelnummer = ?';
    my $out = $dbh->prepare($sql_query);
    $out->execute( "123456" )
        or die 'Select-Fehler: '.$dbh->errstr();
    my $Content = $out->fetchrow_hashref();
    $out->finish();

    # Test 1
    if ($Content->{'Bezeichnung'} eq 'Hängematte') {
        print "Test 1: Content Hängematte found!\n";
    }

    # Test 2
    if (defined $Content->{'Höhe'}) {
        print "Test 2: Key Höhe found!\n";
    } else {
        print "Test 2: Key Höhe not found!\n";
    }
    
    # Hack: Hardcore BadHack!!!
    $Content->{'Höhe'} = $Content->{'Höhe'};
    $Content->{'Länge'} = $Content->{'Länge'};
    $Content->{'Vö-Datum'} = $Content->{'Vö-Datum'};

    # Test 3
    if (defined $Content->{'Höhe'}) {
        print "Test 3: Key Höhe found!\n";
    } else {
        print "Test 3: Key Höhe not found!\n";
    }

    print Dumper($Content);
    
} else {
    print "Verbindungsfehler: " . $dbh->errstr(); 
}
exit(0);

And the output is:

Log4perl: Seems like no initialization happened. Forgot to call init()?
Test 1: Content Hängematte found!
Test 2: Key Höhe not found!
Test 3: Key Höhe found!
$VAR1 = {
          'Vö-Datum' => '2020-08-31',
          'Länge' => 12,
          'Höhe' => 35,
          'Höhe' => 35,
          'Breite' => 20,
          'Länge' => 12,
          'Bezeichnung' => "H\x{e4}ngematte",
          'Vö-Datum' => '2020-08-31',
          'Artikelnummer' => '123456'
        };

First, please ignore the Log4perl warning, because it's only a example code to demonstrate my problem. ;-)

My question is, how i get the right encoded hash-keys without this bad hack in the example or other asked why i get the table rows not utf8 encoded in my hash?

I use perl 5.25, DBIx::Log4perl 0.26, DBI 1.642 and mySQL server 5.7.31

Arties
  • 190
  • 4
  • 11
  • See "Mojibake" in https://stackoverflow.com/questions/38363566/trouble-with-utf8-characters-what-i-see-is-not-what-i-stored – Rick James Aug 06 '20 at 18:49
  • `ö` is Mojibake for `ö`. Hex `e4` is the `latin1` encoding for `ä`. Don't mix latin1 and utf8. – Rick James Aug 06 '20 at 18:54
  • Please provide `SELECT Bezeichnung, HEX(Bezeichnung) ...` so we can see what was `INSERTed` for one of the accented cases. For example, `ä` will show hex of `C3A4` if properly encoded into utf8/utf8mb4; `C383C2A4` will indicate "double encoding". – Rick James Aug 06 '20 at 19:32
  • Also, please provide `SHOW GLOBAL VARIABLES LIKE char%';` – Rick James Aug 06 '20 at 19:36
  • @Rick James, Re "*Please provide `SELECT Bezeichnung, HEX(Bezeichnung)`*", The problem isn't what was inserted; that's correct. Specifically, `"H\x{e4}ngematte"` is correct; it's `Hängematte` as Unicode Code Points, which is exactly what's desired and expected. It's the column names the OP is having problems with and asked for help with – ikegami Aug 06 '20 at 19:43
  • @ikegami - I would like to see the hex, not the Perl representation. – Rick James Aug 06 '20 at 19:46
  • @Rick James, To what end? It's correct in Perl, and that's not what the OP is asking about – ikegami Aug 06 '20 at 19:50
  • @RickJames SELECT Bezeichnung, HEX(Bezeichnung) result is "H\x{e4}ngematte", '48C3A46E67656D61747465' and SHOW GLOBAL VARIABLES LIKE "char%"; show as result latin1, that explain the result. But then seems SET CHARSET utf8 not to work? – Arties Aug 06 '20 at 22:25
  • Note that `ä` is stored as 2 UTF-8 bytes `C3A4` in the table. That is correct based on the `CREATE TABLE`. Not to check the parameters in effect during the `SELECT`. The `SET NAMES utf8` should have overridden the `latin1` that you see. The `SET CHARSET` is a mystery; I have never(?) seen it used. It _seems_ set 2 of the 3 things that `SET NAMES` set. – Rick James Aug 06 '20 at 23:47
  • "To what end" -- I don't trust client software to clearly say what, exactly, is in the table. I have found that `SELECT HEX(..)` "never lies", and lets me conclude whether the `INSERT` worked as expected. – Rick James Aug 06 '20 at 23:51
  • @Rick James. I agree with using `SELECT HEX()` to look at data; I said there's no point in `SELECT HEX(Bezeichnung)` since that has nothing to do with the problem or the question. – ikegami Aug 07 '20 at 03:33
  • @Rick James, Re "*`SET CHARSET` is a mystery*" It does 2 of the three things that `SET NAMES` already does. It's completely redundant here. – ikegami Aug 07 '20 at 03:35
  • @Arties, Re "*But then seems SET CHARSET utf8 not to work?*", It did work. You got UTF-8 just as you requested, and `mysql_enable_utf8` decoded the string *values* as you requested – ikegami Aug 07 '20 at 03:39
  • ikegami, but than should "SET CHARTSET utf8" without "mysql_enable_utf8" also work, or not? – Arties Aug 07 '20 at 06:56
  • Without `mysql_enable_utf8`, no decoding will happen, so you'll end up with `'Vö-Datum' => '2020-08-31', 'Bezeichnung' => 'Hängematte'`, and you'll have to decode both col names and values instead of just col names. (Perfectly fine thing to do) – ikegami Aug 07 '20 at 15:30
  • okay, thank - i understand! – Arties Aug 09 '20 at 17:44

1 Answers1

5

For starters, let's make it clear that this is not a problem with the database itself. The problem occurs even with correctly formed tables such as the following:

mysql> DESCRIBE MyTable;
+-------+-------------+------+-----+---------+-------+
| Field | Type        | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| Höhe  | varchar(10) | NO   |     | NULL    |       |
+-------+-------------+------+-----+---------+-------+
1 row in set (0.00 sec)

The problem is entirely on the receiving end.


By default, DBD::mysql returns every string as encoded as per the client character set (SET NAMES) in effect.

One can use mysql_enable_utf8 to override that to some extent. Specifically, it does the following:

  • Decodes data retrieved from a textual column type (char, varchar, etc).
  • Executes SET NAMES utf8 or equivalent (only when mysql_enable_utf8 is set as part of connect).

That means it's up to you to:

  • Encode everything sent to the database.[1]
  • Decode any other strings received from the databases.

That latter includes column names. You can use the following to decode the hash returned by fetchrow_hashref.

sub _d { my ($s) = @_; utf8::decode($s); $s }

sub decode_keys {
   my ($hash) = @_;
   return { map { _d($_) => $hash->{$_} } keys(%$hash) };
}

Reproducing the problem and demonstrating the fix:

use 5.014;
use warnings;

use utf8;                             # Source saved as UTF-8
use open ':std', ':encoding(UTF-8)';  # Terminal expects UTF-8

use Data::Dumper qw( Dumper );
use DBI          qw( );

sub _e { my ($s) = @_; utf8::encode($s); $s }
sub _d { my ($s) = @_; utf8::decode($s); $s }

sub decode_keys { { map { _d($_) => $_ } keys(%{ $_[0] }) } }

my $host     = ...;
my $db       = ...;
my $user     = ...;
my $password = ...;

my $dbh = DBI->connect(
   "dbi:mysql:$db;host=$host",
   $user, $password,
   {
      RaiseError => 1,
      PrintError => 0,
      PrintWarn  => 1,
      mysql_enable_utf8 => 1,  # Or mysql_enable_utf8mb4 => 1
   },
);

$dbh->do(_e('
   CREATE TEMPORARY TABLE `MyTable` (
      `Höhe` VARCHAR(10) NOT NULL
   ) ENGINE=MyISAM DEFAULT CHARSET=utf8
'));

$dbh->do(_e('
   INSERT INTO `MyTable`
      SET `Höhe`="Höhe"
'));

my $rows = $dbh->selectall_arrayref(
   _e('SELECT * FROM `MyTable`'),
   { Slice => {} },
);

{
   local $Data::Dumper::Useqq = 1;
   print(Dumper($rows));
}

for my $row (@$rows) {
   for my $col_name (keys(%$row)) {
      say "$col_name: $row->{$col_name}";
   }
}

$_ = decode_keys($_) for @$rows;

{
   local $Data::Dumper::Useqq = 1;
   print(Dumper($rows));
}

for my $row (@$rows) {
   for my $col_name (keys(%$row)) {
      say "$col_name: $row->{$col_name}";
   }
}

Output:

$VAR1 = [
          {
            "H\303\266he" => "H\x{f6}he"
          }
        ];
Höhe: Höhe
$VAR1 = [
          {
            "H\x{f6}he" => "H\x{f6}he"
          }
        ];
Höhe: Höhe

  1. Because of a bug in the module, you can often get away without encoding these.
ikegami
  • 367,544
  • 15
  • 269
  • 518