I'm trying to understand UTF8 in perl.
I have the following string Alizéh. If I lookup the hex for this string I get 416c697ac3a968 from https://onlineutf8tools.com/convert-utf8-to-hexadecimal (This matches with the original source of this string).
So I thought packing that hex and encoding it to utf8 should produce the unicode string. But it produces something very different.
Is anyone able to explain what I'm getting wrong?
Here is a simple test program to show my working.
#!/usr/bin/perl
use strict;
use warnings;
use Text::Unaccent;
use Encode;
use utf8;
binmode STDOUT, ':encoding(UTF-8)';
print "First test that the utf8 string Alizéh prints as expected\n\n";
print "=========================================== Hex to utf8 test start\n";
my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";
print "=========================================== Hex to utf8 test finish\n\n";
print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";
my ($hex) = unpack("H*", $utf8FromCode);
print "Hex of this string is now $hex\n";
print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);
$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now $hex\n";
print "=========================================== utf8 from code test finish\n\n";
This prints:
First test that the utf8 string Alizéh prints as expected
=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish
=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now 416c697ae968
=========================================== utf8 from code test finish
Any tips on how to take the hex value of a UTF8 string and turn it into a valid UTF8 scalar in perl?
There is some further weirdness I'll explain in this extended version
#!/usr/bin/perl
use strict;
use warnings;
use Text::Unaccent;
use Encode;
use utf8;
binmode STDOUT, ':encoding(UTF-8)';
print "First test that the utf8 string Alizéh prints as expected\n\n";
print "=========================================== Hex to utf8 test start\n";
my $hexRepresentationOfTheString = '416c697ac3a968';
my $packedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "The hex of the string is $hexRepresentationOfTheString\n";
print "The string after packing prints as $packedHexIntoPlainString\n";
utf8::encode($packedHexIntoPlainString);
print "Utf8 encoding the string produces $packedHexIntoPlainString\n";
print "=========================================== Hex to utf8 test finish\n\n";
print "=========================================== utf8 from code test start\n";
my $utf8FromCode = "Alizéh";
print "Variable prints as $utf8FromCode\n";
my ($hex) = unpack("H*", $utf8FromCode);
print "Hex of this string is now $hex\n";
print "Decoding the utf8 string\n";
utf8::decode($utf8FromCode);
$hex = unpack ("H*", $utf8FromCode);
print "Hex string is now $hex\n";
print "=========================================== utf8 from code test finish\n\n";
print "=========================================== Unaccent test start\n";
my $plaintest = unac_string('utf8', "Alizéh");
print "Alizéh passed to the unaccent gives $plaintest\n";
my $cleanpackedHexIntoPlainString = pack("H*", $hexRepresentationOfTheString);
print "Packed version of the hex string prints as $cleanpackedHexIntoPlainString\n";
my $packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);
print "Unaccenting the packed version gives $packedtest\n";
utf8::encode($cleanpackedHexIntoPlainString);
print "encoding the packed version it now prints as $cleanpackedHexIntoPlainString\n";
$packedtest = unac_string('utf8', $cleanpackedHexIntoPlainString);
print "Now unaccenting the packed version gives $packedtest\n";
print "=========================================== Unaccent test finish\n\n";
This prints:
First test that the utf8 string Alizéh prints as expected
=========================================== Hex to utf8 test start
The hex of the string is 416c697ac3a968
The string after packing prints as Alizéh
Utf8 encoding the string produces Alizéh
=========================================== Hex to utf8 test finish
=========================================== utf8 from code test start
Variable prints as Alizéh
Hex of this string is now 416c697ae968
Decoding the utf8 string
Hex string is now 416c697ae968
=========================================== utf8 from code test finish
=========================================== Unaccent test start
Alizéh passed to the unaccent gives Alizeh
Packed version of the hex string prints as Alizéh
Unaccenting the packed version gives Alizeh
encoding the packed version it now prints as Alizéh
Now unaccenting the packed version gives AlizA©h
=========================================== Unaccent test finish
In this test it seems that the unaccent library accepts the packed version of the strings hex. I'm not sure why, could anyone please help me understand why that works?