Mojo::DOM breaking UTF8 in Perl

Question

I'm trying to find out how to use Mojo::DOM with UTF8 (and other formats... not just UTF8). It seems to mess up the encoding:

    my $dom = Mojo::DOM->new($html);

    $dom->find('script')->reverse->each(sub {
        #print "$_->{id}\n";
        $_->remove;
    });

    $dom->find('style')->reverse->each(sub {
        #print "$_->{id}\n";
        $_->remove;
    });

    $dom->find('script')->reverse->each(sub {
        #print "$_->{id}\n";
        $_->remove;
    });

    my $html = "$dom"; # pass back to $html, now we have cleaned it up...

This is what I get when saving the file without running it through Mojo:

...and then once through Mojo:

FWIW, I'm grabbing the HTML file using Path::Tiny, with:

my $utf8 = path($_[0])->slurp_raw;

Which to my understanding, should already have the string decoded into bytes ready for Mojo?

UPDATE: After Brians suggestion, I looked into how I could figure out the encoding type to decode it correctly. I tried Encode::Guess and a few others, but they seemed to get it wrong on quite a few. This one seems to do the trick:

my $enc_tmp = `encguess $_[0]`;
my ($fname,$type) = split /\s+/, $enc_tmp;
my $decoded = decode( $type||"UTF-8", path($_[0])->slurp_raw );

brian d foy · Accepted Answer · 2022-06-29T14:49:03.137

7

You are slurping raw octets but not decoding them (storing the raw in $utf8). Then you treat it as if you had decoded it, so the result is mojibake.

If you read raw octets, decode it before you use it. You'll end up with the right Perl internal string.
slurp_utf8 will decode for you.
Likewise, you have to encode when you output again. The open pragma does that in this example.
Mojolicious already has Mojo::File->slurp to get raw octets, so you can reduce your dependency list.

use v5.10;
use utf8;

use open qw(:std :utf8);
use Path::Tiny;
use Mojo::File;
use Mojo::Util qw(decode);

my $filename = 'test.txt';
open my $fh, '>:encoding(UTF-8)', $filename;
say { $fh } "Copyright © 2022";
close $fh;

my $octets = path($filename)->slurp_utf8;

say "===== Path::Tiny::slurp_raw, no decode";
say path($filename)->slurp_raw;

say "===== Path::Tiny::slurp_raw, decode";
say decode( 'UTF-8', path($filename)->slurp_raw );

say "===== Path::Tiny::slurp_utf8";
say path($filename)->slurp_utf8;

say "===== Mojo::File::slurp, decode";
say  decode( 'UTF-8', Mojo::File->new($filename)->slurp );

The output:

===== Path::Tiny::slurp_raw, no decode
Copyright Â© 2022

===== Path::Tiny::slurp_raw, decode
Copyright © 2022

===== Path::Tiny::slurp_utf8
Copyright © 2022

===== Mojo::File::slurp, decode
Copyright © 2022

edited Jun 29 '22 at 14:49

answered Jun 29 '22 at 08:06

brian d foy

129,424
31
207
592

Thank you :) Although, I'm still having fun. The issue is that the documents are grabbed from websites (so can be iso-8859-1, utf16, utf8, etc). So this works: `decode( 'UTF-8', path($filename)->slurp_raw );` , but seeing as other pages are not utf8 - that would break it, no? – Andrew Newby Jun 29 '22 at 09:09
annoyingly, this almost works: my $decoded = decode( 'Detect', path($_[0])->slurp_raw ); , but it messes up the UTF8 again (shows as `Copyright 漏 2019`). Not sure if its the site in question thats the issue - telfordrepro.co.uk - or the Perl code still) – Andrew Newby Jun 29 '22 at 09:22
4

@AndrewNewby the website (most likely) told you the correct encoding when you retrieved the file. If you use Perl to fetch it then you can have it automatically decode using the encoding from the server (e.g. `->text` if using the response from a Mojo::UserAgent, or `->decoded_content` if using LWP). But if you saved the contents to disk without the headers, you threw away that information, which is why you now need to "guess" it. – hobbs Jun 29 '22 at 13:03
@hobbs thanks. We are actually using CLI curl to grab it (via backsick system commands), so it should grab it correctly. The issue was actually around how we read it (and decoded). I ended up using `encguess` as part of the script to look at the files guessed encoding, and then use that. I'm sure there will be cases where that doesn't work as expected, but seems to be doing the job so far :) – Andrew Newby Jun 30 '22 at 05:17
1

@AndrewNewby it's not that curl doesn't "grab it correctly", it's the way you're using curl, it's saving the contents without also saving the headers that tell you how to decode without guessing. – hobbs Jun 30 '22 at 15:17
@hobbs so how do you do that with the CLI version? My understanding is that it gets "saved" in the document? (when I open the document in SubLime or Notepad++ it shows the correct encoding) – Andrew Newby Jul 01 '22 at 06:04
If you have new questions, please ask new questions. – brian d foy Jul 01 '22 at 15:25

Mojo::DOM breaking UTF8 in Perl

1 Answers1