3

In a Webapp I maintain I try to keep everything in UTF-8:

  • the Database (CHARSET=utf8)
  • the source files (use utf8; written in utf8)
  • the templates (for Template Toolkit, using ENCODING => utf8)
  • user input and output (charset=utf8 header in HTTP, binmode :utf8 for STDIN and STDOUT)

But I still need to use Encode::decode('UTF-8',$data) for data coming from the database, or they will get double encoded or somehow broken.

Why is this? How can I get rid of this annoying extra step? Shouldn't there a way to just keep everything, everytime in UTF-8 without having to convert anything by hand?

tex
  • 2,051
  • 1
  • 23
  • 27

3 Answers3

3

To get utf-8 properly from database you need on connection explicitly tell it:

my $dbh = DBI->connect( "dbi:mysql:dbname=$db;host=localhost",
       "user", "pwd", {mysql_enable_utf8 => 1 })

As i asked in my question here, there are still some problems with it, but in most cases it works fine.

To answer "why"-part is much harder. As Denis pointed, there was pretty heavy thread about "why" recently. Maybe it helps you understand related things. I suggest to use utf8::all` module to get utf-8 handling much easier and cleaner.

Community
  • 1
  • 1
w.k
  • 8,218
  • 4
  • 32
  • 55
2

You might find these two threads interesting:

Why does modern Perl avoid UTF-8 by default?

How well does your language support unicode in practice?

Community
  • 1
  • 1
Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154
1

Internally, your database will presumably keep all data in a fixed-with, raw format, usually UCS-4 (i.e. raw strings of 32-bit integers holding one codepoint each). UTF8 is an encoding, and encodings are only used when serializing data (e.g. in a file or database). Deserializing, i.e. reading, means to decode the encoded data and retrieve the raw codepoint string.

Just because you happen to use the same encoding for all your serialization needs doesn't prevent you from decoding when loading and encoding when writing.

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • 2
    Not true in a mySQL database: Data declared as UTF-8 is stored as UTF-8. – Pekka Jun 26 '11 at 20:19
  • also, the storage may be irrelevant if the access api deals with recoding characters to the client's expectations; that may even be part of the client/server network protocol. – araqnid Jun 26 '11 at 20:25
  • It would be very cumbersome for any framework to maintain an internal string buffer in a particular multibyte encoding ... how would you do string length computations? You couldn't even be sure that the data you read is valid. Perhaps some programs do, but others may not. – Kerrek SB Jun 26 '11 at 20:27
  • "It would be very cumbersome for any framework to maintain an internal string buffer in a particular multibyte encoding" - Ruby 1.9 does exactly that: each string has its own encoding property, and calling its length method deals with it accordingly. – Denis de Bernardy Jun 27 '11 at 05:36