3

I'm trying to understand UTF8 in perl.

I have the following string Alizéh. If I lookup the hex for this string I get 416c697ac3a968 from https://onlineutf8tools.com/convert-utf8-to-hexadecimal (This matches with the original source of this string).

So I thought packing that hex and encoding it to utf8 should produce the unicode string. But it produces something very different.

Is anyone able to explain what I'm getting wrong?

Here is a simple test program to show my working.

#!/usr/bin/perl

use strict;
use warnings;

use Text::Unaccent;
use Encode;

use utf8;
binmode STDOUT, ':encoding(UTF-8)';

print "First test that the utf8 string Alizéh prints as expected\n\n";

print "=========================================== Hex to utf8 test start\n";

my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";

print "=========================================== Hex to utf8 test finish\n\n";

print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";

my ($hex) = unpack("H*", $utf8FromCode);

print "Hex of this string is now $hex\n";

print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);

$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now         $hex\n";

print "=========================================== utf8 from code test finish\n\n";

This prints:

First test that the utf8 string Alizéh prints as expected

=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish

=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now         416c697ae968
=========================================== utf8 from code test finish

Any tips on how to take the hex value of a UTF8 string and turn it into a valid UTF8 scalar in perl?

There is some further weirdness I'll explain in this extended version

#!/usr/bin/perl

use strict;
use warnings;

use Text::Unaccent;
use Encode;

use utf8;
binmode STDOUT, ':encoding(UTF-8)';

print "First test that the utf8 string Alizéh prints as expected\n\n";

print "=========================================== Hex to utf8 test start\n";

my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";

print "=========================================== Hex to utf8 test finish\n\n";

print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";

my ($hex) = unpack("H*", $utf8FromCode);

print "Hex of this string is now $hex\n";

print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);

$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now         $hex\n";

print "=========================================== utf8 from code test finish\n\n";

print "=========================================== Unaccent test start\n";

my $plaintest = unac_string('utf8', "Alizéh");

print "Alizéh passed to the unaccent gives $plaintest\n";


my $cleanpackedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "Packed version of the hex string prints as  $cleanpackedHexIntoPlainString\n";

my $packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);

print "Unaccenting the packed version gives $packedtest\n";

utf8::encode($cleanpackedHexIntoPlainString);
print "encoding the packed version it now prints as $cleanpackedHexIntoPlainString\n";

$packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);

print "Now unaccenting the packed version gives $packedtest\n";

print "=========================================== Unaccent test finish\n\n";

This prints:

First test that the utf8 string Alizéh prints as expected

=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish

=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now         416c697ae968
=========================================== utf8 from code test finish

=========================================== Unaccent test start
Alizéh passed to the unaccent gives Alizeh
Packed version of the hex string prints as  Alizéh
Unaccenting the packed version gives Alizeh
encoding the packed version it now prints as Alizéh
Now unaccenting the packed version gives AlizA©h
=========================================== Unaccent test finish

In this test it seems that the unaccent library accepts the packed version of the strings hex. I'm not sure why, could anyone please help me understand why that works?

Dan Walmsley
  • 2,765
  • 7
  • 26
  • 45
  • 1
    As a side note, [Text::Unidecode](https://metacpan.org/pod/Text::Unidecode) performs in the same sort of problem space and is in a much more reliable state. – Grinnz Dec 11 '19 at 00:11

1 Answers1

6

Unicode strings are first-class values in Perl, you do not need to jump through these hoops. You just need to recognize and keep track of when you have bytes and when you have characters, Perl will not differentiate for you, and all byte strings are also valid character strings. Indeed, you are double-encoding your strings, which are still valid as the UTF-8 encoded bytes representing (the characters corresponding to) your UTF-8 encoded bytes.

use utf8; will decode your source code from UTF-8, so by declaring that your following literal strings are already unicode strings and can be passed to any API that correctly accepts characters. To get the same from a string of UTF-8 bytes (as you are producing by packing the hex representation of the bytes), use decode from Encode (or my nicer wrapper).

use strict;
use warnings;
use utf8;
use Encode 'decode';

my $str = 'Alizéh'; # already decoded
my $hex = '416c697ac3a968';
my $bytes = pack 'H*', $hex;
my $chars = decode 'UTF-8', $bytes;

Unicode strings need to be encoded to UTF-8 for output to something that expects bytes, such as STDOUT; a :encoding(UTF-8) layer can be applied to such handles to do this automatically, and the same to automatically decode from input handles. The exact nature of what should be applied depends entirely on where your characters are coming from and where they are going. See this answer for way too much information on the options available.

use Encode 'encode';
print encode 'UTF-8', "$chars\n";
binmode *STDOUT, ':encoding(UTF-8)'; # warning: global effect
print "$chars\n";
Grinnz
  • 9,093
  • 11
  • 18
  • 2
    Perfect. I'd only add that `binmode *STDOUT, ':encoding(UTF-8)'` is one of the things done by `use open ':std', ':encoding(UTF-8)';` – ikegami Dec 10 '19 at 23:53
  • 2
    Re "*`my $str = 'Alizéh'; # already decoded`*", It's already decoded because of `use utf8;`, in case that's not clear. – ikegami Dec 10 '19 at 23:57
  • Thanks. The reason I was doing this was to deal with strings coming back from the database in the wrong encoding. It turns out that they were put in the DB as unicode but when they are read out the Sybase driver is re-encoding them as unicode. So the solution was to decode the string twice. But this answer helped a lot to figure out what was going wrong. I had the encode decode around the wrong way, you explanation was perfect, thanks! – Dan Walmsley Dec 11 '19 at 00:23