Improper UTF-8 and LibXML::Reader

Question

I have a large XML file from a remote source, that says it is 'UTF8', file shows us-ascii.

<?xml version="1.0" encoding="utf-8"?>...

file -bi <file> indicates application/xml; charset=us-ascii
Encode::Guess indicates UTF8

Edit: There is also some code which reads in the file, originally output from a LWP get...I have also try to force some encoding here, but get other errors like wide chars.

my $fh = IO::File->new;
$fh->open( '<' . $filename )
$content = join '', <$fh>;

I am using XML::Reader

my $reader = XML::LibXML::Reader->new(string => $content) or die qq(cannot read content: $!);

while ($reader->nextElement($template->{ 'item' } )) {
    my $copy = $reader->copyCurrentNode(1);
    my $test = $copy->findvalue( 'description' )
...# do other stuff with $copy

This works fine through most of the contents. However, there looks to be some invalid utf-8 or malformed data as it gives an error half way through..
(note, in XML::Bare the whole xml is processed 'fine' as its more forgiving, but the file is on the limit of memory size, so I need a smaller memory xml parser).

Entity: line 64070: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0x1A 0x73 0x20 0x73

If I look in vim at the point after last success, I can see

^Z  or <^Z>  26,  Hex 1a,  Octal 032 with :ascii in vim

I have looked here on SO to try and ensure at least valid UTF-8 as I can't get the origin fixed, and trying...

use Encode qw( encode decode );
my $octets = decode('UTF-8', $content, Encode::FB_DEFAULT );
$content = encode('UTF-8', $octets, Encode::FB_CROAK );

But I still get the same error. I am happy to skip any parts with invalid UTF-8, but the whole parser dies, and I can't see any way to carry on processing later (which I believe is supposed to happen with XML parsing).

My question is, is this the best way to guarantee UTF-8 (assuming I can't get the file changed), or is there a method that should get around the error (I could probably regex that particular char out, but I'm assuming there may be other similar issues later, so feels clunky) ?

Normally I would expect code like `XML::LibXML::Reader->new(location => "http://example.com/file.xml")`, then the XML parser will take care of parsing and decoding as needed. If you load from a string with `new(string => $content)`, where/how do you create that string when you get the error about improper UTF-8? — Martin Honnen, Aug 08 '16 at 14:23
Tip: `open(my $fh, '<:raw', $qfn) or die $!;` would be better, as it ensures the file is "binary". — ikegami, Aug 08 '16 at 14:50
Tip: `XML::LibXML::Reader->new(IO => $fh)` would make far more sense than loading the entire file into memory. — ikegami, Aug 08 '16 at 14:50
Thanks ikegami, there's a different process that loads the files in which is why it was that way, but I will look to see if we can change that slightly, as reducing memory would be valuable. I will also look at the loading in as binary. Very useful. — Ian, Aug 08 '16 at 15:01

ikegami · Accepted Answer · 2016-08-08T16:55:51.627

The error message is misleading; the problem has nothing to do with encoding^[1]. In fact, the error I receive is the following^[2]:

:1: parser error : PCDATA invalid Char value 26

From the XML spec,

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

U+001A may not legally appear in XML files, not even as a character reference ().

Characters referred to using character references must match the production for Char.

If the file is to contain binary data, the binary portions should be encoded (e.g. using base64).

1A, 20 and 73 are all less than 80.
I tested using XML::LibXML rather than XML::LibXML::Reader, but I suspect the relevant difference is actually a difference in the version of XML::LibXML or libxml2.

Thank you, this has led me in the right direction now with looking at valid xml. — Ian, Aug 08 '16 at 16:03

Improper UTF-8 and LibXML::Reader

1 Answers1