Work around the perl DBD::mysql UTF-8 bug

Question

We have a software written in perl which retrieves data from a mysql database. For this we use the DBD::mysql interface

We can retrieve all data correctly, the db is UTF8MB4 and the perl application uses UTF-8.

The code to retrieve the sql result is:

use utf8;
use encoding 'utf8';

...
my $dsn = "DBI:mysql:database=mydatabase;mysql_enable_utf8=1";
my $dbh = DBI->connect($dsn, $userid, $password, { mysql_enable_utf8 => 1 } ) or die $DBI::errstr;

...

my $sth = $dbh->prepare("SELECT addressid, 
                            company, firstname, lastname, 
                            address, zip, city, country,
                            phone, mobile, home,
                            speeddial_phone, speeddial_mobile, speeddial_home,
                            fax, email
                        FROM address
                        WHERE (firstname like ? or lastname like ? or company like ?)
                        LIMIT $sizeLimit
                        ");
$sth->execute(  $searchExpression, $searchExpression, $searchExpression) or die $DBI::errstr;

As long as the $searchExpression is containing normal characters, it works fine. But as soon as we query with special characters notin ASCII, for example é ö ä ü and similar, we don't get a empty result set back.

According to this post, this is due to a bug in the dbd::mysql drivers before version 4.041_01

http://blogs.perl.org/users/mike_b/2016/12/dbdmysql-all-your-utf-8-bugs-are-belong-to-us.html

I have tested different things, but to no avail.

I did turn on request logging in the mysql server and there I see that the parameters with special characters are coming in in a wrong encoding.

Here the output of the mysql log to file:

Time                 Id Command    Argument
180905  9:17:06   403 Connect   inno-ldap-db@localhost on phonebook_innovaphone
                  403 Query     SELECT addressid,
                                company, firstname, lastname,
                                address, zip, city, country,
                                phone, mobile, home,
                                speeddial_phone, speeddial_mobile, speeddial_home,
                                fax, email
                            FROM address
                            WHERE companyid='1' and (firstname like 'andrÃ©%' or lastname like 'andrÃ©%' or company like 'andrÃ©%' )
                            LIMIT 25
                  403 Quit

Since we can't currently upgrade the system (it's debian 7, which includes only older packages like 4.021-1+deb7u3) I would need to implement a work around the problem.

Either some magic to pre-encode/decode the parameters, or would the odbc driver perhaps not experience this bug?

Does '$dsn' contain 'mysql_enable_utf8=1'? I'm not sure the bug you're pointing to is your problem, as that's about DBD::mysql not converting 'latin1' variables to 'utf8'; but $searchExpression should be already utf-8 encoded because of 'use utf8;' at the top. Also, are you sure you retrieve the utf-8 data correctly? Try with fields containing not just western/latin1 characters (ä,ü,etc) but also ą,ă, ш, я. — , Sep 04 '18 at 18:12
I did specify the mysql_enable_utf8=1 in the connect, but I have now also added it to the $dns, with no differences. I now also added a row with the characters mentioned above, they are retrieved correctly. It seems that only the direction perl->mysql has problems. (I don't insert/update data from perl, only query it or as query filter parameter) — André Schild, Sep 05 '18 at 07:07
What does a 'warn "utf8 ", utf8::is_utf8($searchExpression) ? "on\n" : "off\n";' just before the $sth->execute line says? — , Sep 05 '18 at 08:42
I've just tried on a older debian wheezy (libdbd-mysql-perl_4.021-1+deb7u3, libdbi-perl_1.622-1+deb7u1) but wasn't able to reproduce. Sorry. The only thing that I can think of is that the bug is elsewhere, and $searchExpression was already doubly encoded to utf8 before being passed to $sth->execute; but you probably checked that already. — , Sep 05 '18 at 12:59
@mosvy Thanks, that was the hint I needed. It's now working. See answer for the solution. — André Schild, Sep 05 '18 at 14:08

score 3 · Answer 1 · edited Mar 10 '23 at 14:00

To elaborate a bit more: DBD::mysql cannot know what encoding the server expects for bind parameters. Additionally, it has the same unicode bug as many CPAN modules have: it ignores the storage encoding that perl uses and just looks at the internal bits and bytes, so the same string in Perl will come out sometimes correctly, sometimes wrongly, which is the root of the confusion that many people have: the results don't make sense, because the internal encoding of strings is not something that is normally exposed on the perl level.

The arguably correct way to deal with this would be to ask the user to mark every argument as binary (for blob columns) and something else - this is what DBD::MariaDb does - it assumes everything is text unless you override it, and therefore, you won't run into the same issue with it, at the expense of having to do extra work for binary values.

You can deterministically deal with this in DBD::mysql as well, and here is how:

When connecting, enable the mysql_enable_utf8mb4 flag/attr. You should not do it later, but the sequence in the question should work as well. You also should use a charset of utf8mb4 for your database/tables/text columns.

This should reliably deal with data read from the database, as mysql/mariadb will correctly flag text as text (including its encoding) and binary as binary.

The problem left is what to do when submitting data. That's easy, when submitting text, make sure it is in unicode (which means it is not encoded in utf-8, as Perl can represent unicode characters directly). Then you can either upgrade (does not change the meaning of the string in perl, or encode the string to utf8 (which changes the meaning in perl):

# when you have a unicode text string
utf8::upgrade $text_column; # do this
utf8::encode $text_column; # OR that

Either of these will ensure that the internal representation of the Perl string is compatible with utf-8, which is what the database expects. The first one is likely preferable, as it keeps working if the database driver ever is fixed (or when you switch to DBD::MariaDb)

An alternative, if your data is already encoded in utf-8, is to downgrade:

# when you have an utf-8 encoded binary string
utf8::downgrade $text_column;

For blob/binary columns, you need to ensure that Perl's internal representation is not in utf-8. You can ensure this by using utf8::downgrade:

# when you have BLOB data
utf8::downgrade $blob_column;

This also does not change the meaning of the string in Perl, but since DBD::mysql does not care about what Perl thinks, it will do the right thing and submit the data in binary.

Unfortunately, this is no future-compatible version - if you switch to another non-broken database driver, you might have to deal with BLOB data in a different way. One such way would be to mark the parameter as SQL_BLOB and do the downgrade, which should work with correct drivers.

Now for some history.

The confusion in Perl and CPAN modules is, in my personal opinion, rooted in a few different things. First, when implementing unicode for the first time in perl, people realised that the unicode model they had first envisioned for Perl did not work well, and wanted to change it. Unfortunately, the Camel book to document Perl was already updated, so they didn't change it, to stay compatible with the book, with some half-hearted fixes. As a result, perl was no longer fully compatible with the book, and also not fully compatible with the correct model. This has improved over the last years, but it took a long time, and left many victims along the road.

Second, many manual pages, such as perluniintro, are outright wrong, and perpetuate the wrong assumptions that perl knows about the encoding of its character strings, or that the so-called "utf8-flag" has somehting to do with the string being in unicode and/or utf-8. Neither is correct.

And lastly, the XS API was incompatibly changed: To get at the characters in a string in pre-unicode perls, you would call a function called SvPV, which simply returns a pointer to the characters, all of which could be stored in one byte at that time, i.e. binary

Instead of keeping it this way, newer versions of perl kept returning the internal byte representation of the string, so modules that have not been updated will get the raw data, without having the chance to interpret it correctly, because SvPV does not return enough information to decode the data.

Thus, over night, all Perl modules dealing with string data acquired the Perl Unicode Bug(tm), and not all of them have been fixed. And fixing them now might means breaking programs that somehow worked around the issues.

score 1 · Answer 2 · answered Sep 05 '18 at 14:08

It turned out, that the string (received via Net::LDAP::Server) was already in some kind of utf8 encoding, and then the mysql driver did encode it once more.

Solved the problem by adding this code

use Encode qw( decode );
my $decoded = eval { decode('UTF-8', $encoded, Encode::FB_CROAK) }

Code taken from this post: The proper way of encoding detection in perl

Thanks for the hint about double encoding to mosvy

score 0 · Answer 3 · answered Jan 19 '23 at 03:41

To make perl works with utf8 pages and mysql collate you need to set utf8mb4 on mysql fields, set the attribute mysql_enable_utf8mb4 on DBI connection and do the sql query "SET NAMES utf8mb4" after the database connection.

#!/usr/bin/perl
print "Content-type: text/html; charset=UTF-8\n\n";

#use utf8;
#use open ':utf8';
#binmode STDOUT, ":utf8";
#binmode STDIN , ":utf8";
#use encoding 'utf8';

our $dbh = DBI->connect("DBI:mysql:database=$database;host=$servername;port=$port",$username,$password, {PrintWarn => 0, PrintError => 0, mysql_enable_utf8mb4 => 1}) || die;
$dbh->do("SET NAMES utf8mb4");
...

Work around the perl DBD::mysql UTF-8 bug

3 Answers3