My program receives UTF-8 encoded strings from a data source. I need to tamper with these strings, then output them as part of an XML structure. When I serialize my XML document, it will be double encoded and thus broken. When I serialize only the root element, it will be fine, but of course lacking the header.
Here's a piece of code trying to visualize the problem:
use strict; use diagnostics; use feature 'unicode_strings';
use utf8; use v5.14; use encoding::warnings;
binmode(STDOUT, ":encoding(UTF-8)"); use open qw( :encoding(UTF-8) :std );
use XML::LibXML
# Simulate actual data source with a UTF-8 encoded file containing '¿Üßıçñíïì'
open( IN, "<", "./input" ); my $string = <IN>; close( IN ); chomp( $string );
$string = "Value of '" . $string . "' has no meaning";
# create example XML document as <response><result>$string</result></response>
my $xml = XML::LibXML::Document->new( "1.0", "UTF-8" );
my $rsp = $xml->createElement( "response" ); $xml->setDocumentElement( $rsp );
$rsp->appendTextChild( "result", $string );
# Try to forward the resulting XML to a receiver. Using STDOUT here, but files/sockets etc. yield the same results
# This will not warn and be encoded correctly but lack the XML header
print( "Just the root document looks good: '" . $xml->documentElement->serialize() . "'\n" );
# This will include the header but wide chars are mangled
print( $xml->serialize() );
# This will even issue a warning from encoding::warnings
print( "The full document looks mangled: '" . $xml->serialize() . "'\n" );
Spoiler 1: Good case:
<response><result>Value of '¿Üßıçñíïì' has no meaning</result></response>
Spoiler 2: Bad case:
<?xml version="1.0" encoding="UTF-8"?><response><result>Value of '¿ÃÃıçñÃïì' has no meaning</result></response>
The root element and its contents are already UTF-8 encoded. XML::LibXML accepts the input and is able to work on it and output it again as valid UTF-8. As soon as I try to serialize the whole XML document, the wide characters inside get mangled. In a hex dump, it looks as if the already UTF-8 encoded string gets passed through a UTF-8 encoder again. I've searched, tried and read a lot, from Perl's own Unicode tutorial all the way through tchrist's great answer to the Why does modern Perl avoid UTF-8 by default? question. I don't think this is a general Unicode problem, though, but rather a specific issue between me and XML::LibXML.
What do I need to do to be able to output a full XML document including the header so that its contents remain correctly encoded? Is there a flag/property/switch to set?
(I'll gladly accept links to the corresponding part(s) of TFM that I should have R for as long as they are actually helpful ;)