I am completely lost with encoding issues, I have no idea what's going on, what the problem is exactly and how to fix it.
Basically I'm just trying to read an HTML file from a Zip file, parse it then output pieces to XML. Now something funky is happening with the text I get out of the parser.
When parsing the HTML, instead of a space I get á
only if I write to the screen. If I keep it in a variable and write to a file it looks fine in the file. However even though it looks right in the XML something is wrong with it, my PHP parser can't parse that XML nor does IE seem to like it.
I had to first mb_convert_encoding($xmlcontent, "ASCII");
so I could get that XML to parse in PHP.
Any idea what my problem is?
extract HTML from a
.tar.gz
file using Perlmy $tar = Archive::Tar->new; $tar->read("myfile.tar.gz"); $tar->extract_file('index.html', 'output.html');
load HTML, this is where it starts to get funky, I get output like
Numberáofásourceálines
my $tree = HTML::TreeBuilder->new; $tree->parse_file('output.html') or die $!; $tree->elementify;
write to XML
my $output = new IO::File(">output.xml"); my $writer = new XML::Writer(OUTPUT => $output, DATA_MODE => 1,DATA_INDENT => 2);