Perl internal representation of unicode string

Question

I am working on a perl + Mojolicious web application and my front-end send a POST query containing accents in a "a" parameter ("été") using charset utf-8 as I can spy in chrome network tab. But server side script decode that parameter using a charset that I didn't expect. I wrote the following script to reproduce that case.

use utf8; #script encoded in utf8 without bom
use Mojolicious::Lite; 
use Data::HexDump;
{
    require Mojolicious;
    say "perl $^V, Mojolicious: v", Mojolicious->VERSION, ", ", `chcp` ;
}

post '/' => sub{
        my $self = shift;
        my $params = $self->req->params->to_hash;
        app->log->debug("received data:\n", HexDump( $params->{a} ) );
        use Devel::Peek;
        Dump( $params->{a} );
        $self->render( text => "ok for '$params->{a}'" );
    };

if(my $pid = fork()){
    use Mojo::UserAgent;
    my $t = Mojo::UserAgent->new;
    #simulate front-end query
    my $tx  = $t->post('http://127.0.0.1:3042/' => 
                            { 'Content-Type' => 'application/x-www-form-urlencoded; charset=UTF-8' }, 
                            form => {  a => 'été'} 
                        );
    my $res = $tx->res->body;
    say "result:\n", HexDump($res);
    use Devel::Peek;
    Dump( $res );
    kill 'SIGKILL', $pid;
    exit(0);
}

app->start(qw(daemon --listen http://*:3042 ));

The ouput of this script was:

perl v5.20.1, Mojolicious: v6.05, Page de codes active : 850

[Tue May 26 12:31:15 2015] [info] Listening at "http://*:3042"
Server available at http://127.0.0.1:3042
[Tue May 26 12:31:16 2015] [debug] Your secret passphrase needs to be changed
[Tue May 26 12:31:16 2015] [debug] POST "/"
[Tue May 26 12:31:16 2015] [debug] Routing to a callback
[Tue May 26 12:31:16 2015] [debug] received data:

          00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F  0123456789ABCDEF

00000000  E9 74 E9                                           .t.

SV = PVMG(0x5a7a198) at 0x4dce730
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x5b62c48 "\303\251t\303\251"\0 [UTF8 "\x{e9}t\x{e9}"]
  CUR = 5
  LEN = 10
[Tue May 26 12:31:16 2015] [debug] 200 OK (0.005052s, 197.941/s)
result:
          00 01 02 03 04 05 06 07 - 08 09 0A 0B 0C 0D 0E 0F  0123456789ABCDEF

00000000  6F 6B 20 66 6F 72 20 27 - C3 A9 74 C3 A9 27        ok for '..t..'

SV = PV(0x41a73e8) at 0x4927070
  REFCNT = 1
  FLAGS = (PADMY,POK,IsCOW,pPOK)
  PV = 0x5aa1328 "ok for '\303\251t\303\251'"\0
  CUR = 14
  LEN = 16
  COW_REFCNT = 1

So we can see that the server receive the "a" parameter in an string flagged utf8 that contain the buffer "\x{e9}t\x{e9}".

I was expecting "été" with the hexa "C3 A9 74 C3 A9".

What is wrong?

I see I have a confusion between unicode (standard) and utf* (encoding) after reading this [answers](http://stackoverflow.com/questions/3951722/whats-the-difference-between-unicode-and-utf8). But in the Dump above, the flag utf8 followed but unicode characters keep me in trouble. (`[UTF8 "\x{e9}t\x{e9}"]`) — xlat, May 28 '15 at 07:17
In the Devel::Peek output line `PV = 0x5b62c48 "\303\251t\303\251"\0 [UTF8 "\x{e9}t\x{e9}"]` : I did not notice that sequence \303\251 stand for C3 A9. — xlat, May 28 '15 at 08:14
Another trap in my code was that `Data::HexDump` was not showing me the byte semantics of string buffer but the character semantic. In order to see the real binary data I must do something like `use Devel::Hexdump 'xd';` — xlat, May 28 '15 at 08:16

optional · Accepted Answer · 2015-05-28T00:52:25.117

update: There is nothing wrong with your program, you are getting été just like you wanted, its simply Dumped as the perl unicode string "\xE9t\xE9", they're the same thing, perl unicode strings aren't stored in memory as utf8, they're decoded from utf into unicode codepoints/ordinals, utf8 is just a way to encode/represent unicode codepoints/ordinals é is the ordinal 233, check the wikipedia link below (also updated program)

Um, été is only C3 A9 74 C3 A9 in utf8, in numbers/ordinals été is 233 116 233

which as a perl unicode string is \xE9t\xE9, the number 233 is E9 in hex

update: before I created the utf8 file 2 with an editor, here its created with perl. You can see its got the right bytes you expect, and dd the difference when you read it as utf or as raw

$ perl -CS -e " print chr(233), chr(116), chr(233) " >2

$ od -tx1 2
0000000 c3 a9 74 c3 a9
0000005

$ type 2
été
$
$ perl -MData::Dump -MPath::Tiny -e " dd ( path(2)->slurp_raw ) "
"\xC3\xA9t\xC3\xA9"

$ perl -MData::Dump -MPath::Tiny -e " dd ( path(2)->slurp_utf8 ) "
"\xE9t\xE9"

$ perl -MData::Dump -MPath::Tiny -e " dd( map { [ $_, ord$_ ] } split //, path(2)->slurp_utf8 ) "
(["\xE9", 233], ["t", 116], ["\xE9", 233])

I'm not sure about what `path(2)` is supposed to do, I got `Error open (<:unix) on '2': No such file or directory at -e line 1.` with first oneliner, Butif I create a file named `2` it dump it's content. Sorry but I don't see how your answers help me to point what's wrong in my code. — xlat, May 27 '15 at 12:13
@xlat, looks like he placed a BOM and "été" in a UTF-8-encoded file named `2`. — ikegami, May 27 '15 at 12:56
update for you, nothing wrong with your program, you are getting été like you wanted — optional, May 28 '15 at 00:53
Thank you guys, I just figured out that my assumption was wrong, but now I am still in trouble with perl internal representation of a string flagged as utf8 which contain unicode (`\xe9`) rather than utf8 (`\xC3\xA9`). So I may rename the subject to something like "Perl internal representation of unicode string" and read more manuals... — xlat, May 28 '15 at 06:38

score 2 · Answer 2 · answered May 27 '15 at 14:06

U+00E9 is the code point for é. c3 a9 is the UTF-8 encoding. To see the UTF-8 encoded form of 'é', you need to UTF-8 encode it. For example:

#!/usr/bin/env perl -l

use utf8;
use strict;
use warnings;
use Unicode::UTF8 qw( encode_utf8 );

binmode STDOUT, ':encoding(UTF-8)';

my $é = "\x{e9}";

print $é;
printf "%v02x\n", encode_utf8($é);

Output:

$ ./u.pl
é
c3.a9

Perl internal representation of unicode string

2 Answers2