4

I have a value in MySQL that contains an apostrophe () and an ellipsis (...):

$ /bin/echo "select alias from url_alias where source = 'node/12024'" | \
  mysql --skip-column-names -D cat36ia_d7prod

Output:

forum/technical-discussion/nagging-questions-i’ve-been-too-embarrassed-ask…

When I retrieve the value with Perl DBI and DBD::mysql, the value has been changed:

$ perl -MDBI -MDBD::mysql -e
      '$dbh=DBI->connect( "DBI:mysql:database=my_db",nick );
       $v=$dbh->selectrow_array(qq|select alias from url_alias where source = "'node/12024'"|);
       print "$v\n";'

Output:

forum/technical-discussion/nagging-questions-i?ve-been-too-embarrassed-ask?

Why is Perl doing this? Can I override it?

ThisSuitIsBlackNot
  • 23,492
  • 9
  • 63
  • 110
  • 1
    This is not an "escaping" issue; it's an issue with characterset translation. By default, Perl uses ASCII for STDIN, STDOUT and STDERR; and yes, you can override that behavior. Also, your database connection may be specifying a characterset that doesn't support the code points. And yes, that can be overridden as well. A couple of articles you should have under your belt: [**The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)**](http://www.joelonsoftware.com/articles/Unicode.html) – spencer7593 Jan 13 '15 at 16:51
  • 1
    And the other article: [**What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text**](http://kunststube.net/encoding/) – spencer7593 Jan 13 '15 at 16:51

3 Answers3

6
  1. Tell Perl how to encode the output.

    use open ':std', ':encoding(UTF-8)';
    
  2. Get the data from the database as text by using

    DBI->connect("DBI:mysql:database=my_db", $user, $pass, {
       mysql_enable_utf8 => 1,
    })
    
ikegami
  • 367,544
  • 15
  • 269
  • 518
3

You probably need to tell DBI to use UTF8 when talking to the database.

$dbh=DBI->connect(
   'DBI:mysql:database=my_db', $user, $pass,
   { mysql_enable_utf8 => 1 }
);
Dave Cross
  • 68,119
  • 3
  • 51
  • 97
2

Q: Why is Perl doing this? Can I override it?

It's not being escaped. That's a symptom of a characterset translation issue. The question mark character is a default character used when a code point doesn't map to any other character in the target characterset.


The short answer, as to why Perl is doing this may be: by default, Perl outputs to STDOUT using ascii characterset. Since ASCII only supports code points up to U+00EF, all other code points (for example, characters 128 thru 255) get translated to a question mark character.

The short answer as to how to override this behavior may be: specify that STDIN, STDOUT and STDERR use utf8 encoding rather than ascii by including a line like this in your perl program:

use open qw(:std :utf8);

Another potential issue is the setting of the MySQL session character_set_client variable; the database connection may be using a latin1 characterset, but the database/server/column characterset may be utf8, so a characterset translation may also be occurring there.

And it's possible to specify the characterset to be used in the database connection, to avoid an unwanted characterset translation.


As a starting point of understanding charactersets, here's two references you should have under your belt:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

spencer7593
  • 106,611
  • 15
  • 112
  • 140