2

I try to read an HTML file with the Perl module File::Slurp:

binmode STDOUT, ':utf8';
my $htmlcontent = read_file($file, {binmode => ':utf8'});

But when I print the $htmlcontent variable, some characters are not understood, due to French accents or special characters.

For example : "Plus d'actualit\u00e9s" should be "Plus d'actualités"

I also checked the encoding of the file and it's ok!

HTML document, UTF-8 Unicode text, with very long lines, with CRLF, LF line terminators

Is there a problem with this module?

Thanks

Jim Davis
  • 5,241
  • 1
  • 26
  • 22

1 Answers1

2

\u00e9 is not an UTF-8 character, is JavaScript represent of Unicode character. You need decode content of file with Encode::JavaScript::UCS for example.

Denis Ibaev
  • 2,470
  • 23
  • 29
  • I tried your solution but still the same problem. I've tested on another machine and the problem disappeared. I think it is a problem of OS environment. –  Jun 01 '15 at 11:01