2

My program receives UTF-8 encoded strings from a data source. I need to tamper with these strings, then output them as part of an XML structure. When I serialize my XML document, it will be double encoded and thus broken. When I serialize only the root element, it will be fine, but of course lacking the header.

Here's a piece of code trying to visualize the problem:

use strict; use diagnostics;    use feature 'unicode_strings';
use utf8;   use v5.14;      use encoding::warnings;
binmode(STDOUT, ":encoding(UTF-8)");    use open qw( :encoding(UTF-8) :std );
use XML::LibXML

# Simulate actual data source with a UTF-8 encoded file containing '¿Üßıçñíïì'
open( IN, "<", "./input" ); my $string = <IN>; close( IN ); chomp( $string );
$string = "Value of '" . $string . "' has no meaning";

# create example XML document as <response><result>$string</result></response>
my $xml = XML::LibXML::Document->new( "1.0", "UTF-8" );
my $rsp = $xml->createElement( "response" );    $xml->setDocumentElement( $rsp );
$rsp->appendTextChild( "result", $string );

# Try to forward the resulting XML to a receiver. Using STDOUT here, but files/sockets etc. yield the same results
# This will not warn and be encoded correctly but lack the XML header
print( "Just the root document looks good: '" . $xml->documentElement->serialize() . "'\n" );
# This will include the header but wide chars are mangled
print( $xml->serialize() );
# This will even issue a warning from encoding::warnings
print( "The full document looks mangled: '" . $xml->serialize() . "'\n" );

Spoiler 1: Good case:

<response><result>Value of '¿Üßıçñíïì' has no meaning</result></response>

Spoiler 2: Bad case:

<?xml version="1.0" encoding="UTF-8"?><response><result>Value of '¿ÃÃıçñíïì' has no meaning</result></response>

The root element and its contents are already UTF-8 encoded. XML::LibXML accepts the input and is able to work on it and output it again as valid UTF-8. As soon as I try to serialize the whole XML document, the wide characters inside get mangled. In a hex dump, it looks as if the already UTF-8 encoded string gets passed through a UTF-8 encoder again. I've searched, tried and read a lot, from Perl's own Unicode tutorial all the way through tchrist's great answer to the Why does modern Perl avoid UTF-8 by default? question. I don't think this is a general Unicode problem, though, but rather a specific issue between me and XML::LibXML.

What do I need to do to be able to output a full XML document including the header so that its contents remain correctly encoded? Is there a flag/property/switch to set?

(I'll gladly accept links to the corresponding part(s) of TFM that I should have R for as long as they are actually helpful ;)

Community
  • 1
  • 1
Olfan
  • 579
  • 6
  • 17
  • Note, `use open qw( :encoding(UTF-8) :std );` already does `binmode(STDOUT, ":encoding(UTF-8)");` – ikegami Jan 13 '14 at 17:43

3 Answers3

5

ikegami is correct, but he didn't really explain what's wrong. To quote the docs for XML::LibXML::Document:

IMPORTANT: unlike toString for other nodes, on document nodes this function returns the XML as a byte string in the original encoding of the document (see the actualEncoding() method)!

(serialize is just an alias for toString)

When you print a byte string to a filehandle marked with an :encoding layer, it gets encoded as if it were ISO-8859-1. Since you have a string containing UTF-8 bytes, it gets double encoded.

As ikegami said, use binmode(STDOUT) to remove the encoding layer from STDOUT. You could also decode the result of serialize back into characters before printing it, but that assumes the document is using the same encoding you have set on your output filehandle. (Otherwise, you'll emit a XML document whose actual encoding doesn't match what its header claims.) If you're printing to a file instead of STDOUT, open it with '>:raw' to avoid double encoding.

cjm
  • 61,471
  • 9
  • 126
  • 175
  • That's the exact point in the documentation that I missed, thank you. So I'm putting a mix of binary and string objects through my output channels which of course can't work. I'll try to fix my code accordingly. – Olfan Jan 15 '14 at 10:59
  • Thanks for the hint that multi-encoding is not lossy so that I can encode twice and just decode the superfluous encoding runs without damaging my output. – Olfan Jan 15 '14 at 11:07
  • He didn't suggest that you double-encode or that doing so isn't lossy. (Depends on the encoding whether it is or not.) He simply said that double-encoding was occurring. He did suggest that you could encode-decode-encode, which can result in corrupt XML docs. (Depends on the encodings whether it will or not.) – ikegami Jan 15 '14 at 13:51
3

Since XML documents are parsed without needing any external information, they are binary files rather than text files.

You're telling Perl to encode anything sent to STDOUT[1], but then you proceed to output an XML document to it. You can't apply a character encoding to a binary file as it corrupts it.

Replace

binmode(STDOUT, ":encoding(UTF-8)");

with

binmode(STDOUT);

Note: This assumes the rest of the text you are outputting is just temporary debugging information. The output doesn't otherwise make sense.


  1. In fact, you do this twice! Once using use open qw( :encoding(UTF-8) :std );, and then a second time using binmode(STDOUT, ":encoding(UTF-8)");.
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • @cjm, Added an explanation. – ikegami Jan 13 '14 at 17:49
  • That seems to nail it, yes. Actually, knowing this doesn't make my life too much easier because now I need to find out exactly when to switch my output channels to raw mode and when to reverse that again. But the question as asked is hereby answered, for which I thank you a whole lot. – Olfan Jan 15 '14 at 11:03
  • Why are you mixing text and binary files? Don't expect weird things to be easier to normal things. – ikegami Jan 15 '14 at 13:49
  • Are you sure it's better to output a potentially wrong header rather than none? – ikegami Jan 15 '14 at 13:55
  • This whole issue is only about communication between individual modules of the same application. Incoming data is converted to UTF-8, outgoing data is encoded to whatever is requested, but internally it's UTF-8 for everyone with no exception. Playing with the code I now find it easier to immediately decode the XML again instead of fiddling with the output encoding. – Olfan Jan 16 '14 at 10:27
  • Re "This whole issue is only about communication between individual modules", You communicate between modules by printing to STDOUT? Cause the problem is your printing to STDOUT incorrectly – ikegami Jan 16 '14 at 12:23
  • I agree with not filling with the output encoding. You shouldn't be doing that. You should set it once at the start. You're surely doing something wrong if you're mixing text and binary. – ikegami Jan 16 '14 at 12:27
  • I just didn't want to add unnecessary complexity to the example code. The application that had the problem uses various databases, a message queue, files, sockets and modules written in several other languages than Perl, all of which is irrelevant to the problem at hand, so I just chose STDOUT to make the example code as compact as possible. I never intended to "mix text and binary". I was surprised by libxml2 returning binary data which led me to ask my question in the first place. Now that I know it does, it all makes sense and I can work with the results I get. – Olfan Jan 17 '14 at 10:38
  • The file handle name isn't relevant to my point. Could just as easily have been `$socket` instead of `STDOUT`. – ikegami Jan 17 '14 at 13:49
0

I do not like changing settings of STDOUT because of specific features of "toString()" in two modules XML::LibXML::Document, XML::LibXML::Element. So, I do prefer to add "Encode::encode" where it is required. You may run the following example:

use strict;
use warnings FATAL => 'all';
use XML::LibXML;

my ( $doc, $main, $nodelatin, $nodepolish );
$doc = XML::LibXML::Document->createDocument( '1.0', 'UTF-8' );

$main = $doc->createElement('main');
$doc->addChild($main);

$nodelatin = $doc->createElement('latin');
$nodelatin->appendTextNode('Lorem ipsum dolor sit amet');
$main->addChild($nodelatin);

print __LINE__, ' ', $doc->toString();                            # printed OK          
print __LINE__, ' ', $doc->documentElement()->toString(), "\n\n"; # printed OK

$nodepolish = $doc->createElement('polish');
$nodepolish->appendTextNode('Zażółć gęślą jaźń');
$main->addChild($nodepolish);

print __LINE__, ' ', $doc->toString();                            # printed OK
print __LINE__, ' ', Encode::encode("UTF-8", $doc->documentElement()->toString()), "\n"; # printed OK
print __LINE__, ' ', $doc->documentElement()->toString(), "\n";   # Wide character in print