0

I have script for reading html files in Perl, it works, but it breaks encoding.

This is my script:

use utf8;
use Data::Dumper;

open my $fr, '<', 'file.html' or die "Can't open file $!";
my $content_from_file = do { local $/; <$fr> };

print Dumper($content_from_file);

Content of file.html:

<span class="previews-counter">Počet hodnotení: [%product.rating_votes%]</span>
<a href="#" title="[%L10n.msg('Zobraziť recenzie')%]" class="previews-btn js-previews-btn">[%L10n.msg('Zobraziť recenzie')%]</a>

Output from reading:

<span class=\"previews-counter\">Po\x{10d}et hodnoten\x{ed}: [%product.rating_votes%]</span>
<a href=\"#\" title=\"[%L10n.msg('Zobrazi\x{165} recenzie')%]\" class=\"previews-btn js-previews-btn\">[%L10n.msg('Zobrazi\x{165} recenzie')%]</a>

As you can see lot of characters are escaped, how can I read this file and show content of it as it is?

tomsk
  • 967
  • 2
  • 13
  • 29
  • Did you `use Data::Dumper` somewhere? – ernix Nov 17 '18 at 12:45
  • @ernix yes, why? – tomsk Nov 17 '18 at 12:46
  • I've just tried to run your script. Try `open my $fr, '<:raw', 'file.html' or die "Can't open file $!";`, it seems your script modifies IO layer, to read all strings from open as UTF8 decoded. see https://perldoc.perl.org/PerlIO.html and https://perldoc.perl.org/perlopentut.html – ernix Nov 17 '18 at 12:51
  • 1
    Start by reading http://perldoc.perl.org/perluniintro.html – Shawn Nov 17 '18 at 12:51
  • 1
    I also usually suggest using [File::Slurper](https://metacpan.org/pod/File::Slurper) for reading a file into a string. – Shawn Nov 17 '18 at 12:54
  • Data::Dumper is escaping high-bit characters and metacharacters in your string, this is a feature of Data::Dumper which is designed for debugging. Don't use Data::Dumper if you want to output the actual string. – Grinnz Nov 17 '18 at 18:02
  • I recommend reading https://perlgeek.de/en/article/encodings-and-unicode – tinita Nov 17 '18 at 18:34

1 Answers1

4

You open the file with perl's default encoding:

open my $fh, '<', ...;

If that encoding doesn't match the actual encoding, Perl might translate some characters incorrectly. If you know the encoding, specify it in the open mode:

open my $fh, '<:utf8', ...;

You aren't done yet, though. Now that you have a probably decoded string, you want to output it. You have the same problem again. The standard output file handle's encoding has to match what you are trying to print to. If you've set up your terminal (or whatever) to expect UTF-8, you need to actually output UTF-8. One way to fix that is to make the standard filehandles use UTF-8:

use open qw(:std :utf8);

You have use utf8, but that only signals the encoding for your program file.

I've written a much longer primer for Perl and Unicode in the back of Learning Perl. The StackOverflow question Why does modern Perl avoid UTF-8 by default? has lots of good advice.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • 1
    To be more specific: the standard output file handle's encoding has to match what you are trying to print **to**. Normally, this is a terminal that expects UTF-8. Don't use the `:utf8` layer, use `:encoding(UTF-8)`. – Grinnz Nov 17 '18 at 18:00
  • I changed open statement to `open my $fr, '<:encoding(UTF-8)', $file or die "Can't open file $!";` and I included at the beginning `use open qw(:std :utf8);`, but it is still broken and added `binmode STDOUT, ':encoding(UTF-8)';` before `Dumper`. – tomsk Nov 17 '18 at 18:16
  • It's not broken. It's a feature of Data::Dumper. @Grinnz already told you to not use Data::Dumper. Why don't you just print the string? Is there any reason why you are using Data::Dumper other than debugging? – tinita Nov 17 '18 at 18:35
  • @tinita oh, it's is ok, it just Dumper breaks, so then it is solved :) – tomsk Nov 17 '18 at 19:21