1

I am completely lost with encoding issues, I have no idea what's going on, what the problem is exactly and how to fix it.

Basically I'm just trying to read an HTML file from a Zip file, parse it then output pieces to XML. Now something funky is happening with the text I get out of the parser.

When parsing the HTML, instead of a space I get á only if I write to the screen. If I keep it in a variable and write to a file it looks fine in the file. However even though it looks right in the XML something is wrong with it, my PHP parser can't parse that XML nor does IE seem to like it.

I had to first mb_convert_encoding($xmlcontent, "ASCII"); so I could get that XML to parse in PHP.

Any idea what my problem is?

  1. extract HTML from a .tar.gz file using Perl

    my $tar = Archive::Tar->new;
    $tar->read("myfile.tar.gz");
    $tar->extract_file('index.html', 'output.html');
    
  2. load HTML, this is where it starts to get funky, I get output like Numberáofásourceálines

    my $tree = HTML::TreeBuilder->new;
    $tree->parse_file('output.html') or die $!;
    $tree->elementify;
    
  3. write to XML

    my $output = new IO::File(">output.xml");
    my $writer = new XML::Writer(OUTPUT => $output, DATA_MODE => 1,DATA_INDENT => 2);
    
daxim
  • 39,270
  • 4
  • 65
  • 132
user391986
  • 29,536
  • 39
  • 126
  • 205
  • 3
    Is your data multi-byte unicode? The strange character smells like it might be. Use a proper character encoding on it. – Rasika Jun 16 '11 at 23:53
  • 2
    You first need to know about the encoding of the input data. Convert it to UTF-8 then. – hakre Jun 17 '11 at 00:26
  • how do I determine the encoding? the file is generated automatically from a tool, I have no control over it I can only process it – user391986 Jun 17 '11 at 00:44
  • Does the HTML have a meta tag defining the character set used? Often it will. – DavidO Jun 17 '11 at 04:58
  • Show the HTML file. Upload it unchanged somewhere, or provide a hex dump of it. – daxim Jun 17 '11 at 08:15

2 Answers2

2

If it looks correct when you write it to a file and wrong when you write it to the terminal, it sounds like your terminal is expecting the wrong encoding. Check your terminal settings.'

Also, see Jon Rockway's answer to "Why does modern Perl avoid UTF-8 by default?". With encodings, you have to convert your input to the correct encoding and convert your output to the correct encoding. Everything that looks at the data needs to know which encoding you're using.

Community
  • 1
  • 1
brian d foy
  • 129,424
  • 31
  • 207
  • 592
1

I think I just fixed it by processing this on the html before parsing it, thanks for all the great pointers!

s/\&nbsp\;/ /g;
user391986
  • 29,536
  • 39
  • 126
  • 205