I have a large XML file from a remote source, that says it is 'UTF8', file shows us-ascii.
<?xml version="1.0" encoding="utf-8"?>...
file -bi <file> indicates application/xml; charset=us-ascii
Encode::Guess indicates UTF8
Edit: There is also some code which reads in the file, originally output from a LWP get...I have also try to force some encoding here, but get other errors like wide chars.
my $fh = IO::File->new;
$fh->open( '<' . $filename )
$content = join '', <$fh>;
I am using XML::Reader
my $reader = XML::LibXML::Reader->new(string => $content) or die qq(cannot read content: $!);
while ($reader->nextElement($template->{ 'item' } )) {
my $copy = $reader->copyCurrentNode(1);
my $test = $copy->findvalue( 'description' )
...# do other stuff with $copy
This works fine through most of the contents. However, there looks to be some invalid utf-8 or malformed data as it gives an error half way through..
(note, in XML::Bare the whole xml is processed 'fine' as its more forgiving, but the file is on the limit of memory size, so I need a smaller memory xml parser).
Entity: line 64070: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0x1A 0x73 0x20 0x73
If I look in vim at the point after last success, I can see
^Z or <^Z> 26, Hex 1a, Octal 032 with :ascii in vim
I have looked here on SO to try and ensure at least valid UTF-8 as I can't get the origin fixed, and trying...
use Encode qw( encode decode );
my $octets = decode('UTF-8', $content, Encode::FB_DEFAULT );
$content = encode('UTF-8', $octets, Encode::FB_CROAK );
But I still get the same error. I am happy to skip any parts with invalid UTF-8, but the whole parser dies, and I can't see any way to carry on processing later (which I believe is supposed to happen with XML parsing).
My question is, is this the best way to guarantee UTF-8 (assuming I can't get the file changed), or is there a method that should get around the error (I could probably regex that particular char out, but I'm assuming there may be other similar issues later, so feels clunky) ?