I am writing a Perl script that needs to extract some data from an XML file.
The XML file itself is encoded using UTF-8. For some reason, however, what I extract from the file ends up being encoded as ISO-8859-1. The documentation states that whatever is passed to my handlers should be UTF-8, but it just isn't.
The parser is basically something like this:
my $parser = XML::Parser->new( Handlers => {
# Some unrelated handlers here
Char => sub {
my ( $expat, $string ) = @_;
if ( exists $data->{$curId}{$curField} ) {
$data->{$curId}{$curField} .= $string;
} else {
$data->{$curId}{$curField} = $string;
}
} ,
} );
I have tried the following variants for actually parsing:
- file parsed directly through
$parser->parsefile
, no options; - file parsed directly through
$parser->parsefile
, with theProtocolEncoding
option; - file opened using
open( $handle , "<file.xml" )
then parsed through$parser->parse
; - file opened using
open( $handle , '<:utf8' , "file.xml" )
then parsed through$parser->parse
.
In addition, I have tried each version with and without the <?xml encoding="utf-8"?>
header in the file.
In all cases, what ends up in $data->{$curId}{$curField}
is encoded using ISO-8859-1.
What am I doing wrong?