0

I have a problem with perl output : the french word "préféré" is sometimes outputted "pr�f�r�" :

The sample script :

devel@k0:~/tmp$ cat 02.pl    
#!/usr/bin/env perl

use strict;
use warnings;

print "préféré\n";  

open( my $fh, '<:encoding(UTF-8)', 'text' ) ;

while ( <$fh> ) { print $_ }

close $fh;

exit;

The execution :

devel@k0:~/tmp$ ./02.pl 
préféré
pr�f�r�
devel@k0:~/tmp$ cat text
préféré
devel@k0:~/tmp$ file text
text: UTF-8 Unicode text

Can please someone help me ?

amir
  • 49
  • 4
  • 2
    Add `use utf8;` to your script if it's encoded in utf-8. – Shawn Aug 12 '22 at 16:46
  • 2
    And look into the `use open` pragma to adjust the encoding of STDOUT. – Shawn Aug 12 '22 at 16:47
  • 1
    This isn't a dup of the linked question. The linked question only addresses one of the two problems in the OP. – ikegami Aug 13 '22 at 20:09
  • "_This isn't a dup of the linked question_" -- Seconding that, this isn't a dupe of the given link. And, what is this "bot" thing -- did a program decide that this is a dupe?? It "looked" similar enough, aye? The titles have small Levenshtein distance? (Besides, an asnwer given here is far superior to the mere listing of methods on the linked page. But how can a bot see that...) – zdim Aug 14 '22 at 00:58

2 Answers2

3

Decode your inputs, encode your outputs. You have two bugs related to failure to properly decode and encode.

Specifically, you're missing

use utf8;
use open ":std", ":encoding(UTF-8)";

Details follow.


Perl source code is expected to be ASCII (with 8-bit clean string literals) unless you use use utf8 to tell Perl it's UTF-8.

I believe you have a UTF-8 terminal. We can conclude from the fact that cat 02.pl works that your source code is encoded using UTF-8. This means Perl sees the equivalent of this:

print "pr\x{C3}\x{A9}f\x{C3}\x{A9}r\x{C3}\x{A9}\n";   # C3 A9 = é encoded using UTF-8

You should be using use utf8; so Perl sees the equivalent of

print "pr\x{E9}f\x{E9}r\x{E9}\n";                     # E9 = Unicode Code Point for é

You correctly decode the file you read.

The file presumably contains

70 72 C3 A9 66 C3 A9 72 C3 A9 0A     # préféré␊ encoded using UTF-8

Because of the encoding layer you add, you are effectively doing

$_ = decode( "UTF-8", "\x{70}\x{72}\x{C3}\x{A9}\x{66}\x{C3}\x{A9}\x{72}\x{C3}\x{A9}\x{0A}" );

or

$_ = "pr\x{E9}f\x{E9}r\x{E9}\n";

This is correct.


Finally, you fail to encode your outputs.

The following does what you want:

#!/usr/bin/env perl

use strict;
use warnings;

use utf8;

BEGIN {
   binmode( STDIN,  ":encoding(UTF-8)" );  # Well, not needed here.
   binmode( STDOUT, ":encoding(UTF-8)" );
   binmode( STDERR, ":encoding(UTF-8)" );
}

print "préféré\n";  

open( my $fh, '<:encoding(UTF-8)', 'text' ) or die $!;

while ( <$fh> ) { print $_ }

close $fh;

But the open pragma makes it a lot cleaner. The following does what you want:

#!/usr/bin/env perl

use strict;
use warnings;

use utf8;
use open ":std", ":encoding(UTF-8)";

print "préféré\n";  

open( my $fh, '<', 'text' ) or die $!;

while ( <$fh> ) { print $_ }

close $fh;
ikegami
  • 367,544
  • 15
  • 269
  • 518
1

UTF-8 is an interesting problem. First, your Perl itself will print correctly, because you don't do any UTF-8 Handling. You have an UTF-8 String, but Perl itself don't really know that it is UTF-8, and it will also print it, as-is.

So an an UTF-8 Terminal. Everything looks fine. Even that's not the case.

When you add use utf8; to your source-code. You will see, that your print now will produce the same garbage. But if you have string containing UTF-8. That's what you should do.

use utf8;

# Now also prints garbage
print "préféré\n";

open my $fh, '<:encoding(UTF-8)', 'text';
while ( <$fh> ) {
    print $_;
}
close $fh;

Next. For every input you do from the outside, you need to do an decode, and for every output you do. You need todo an encode.

use utf8;
use Encode qw(encode decode);

# Now correct
print encode("UTF-8", "préféré\n");

open my $fh, '<:encoding(UTF-8)', 'text';
while ( <$fh> ) {
    print encode("UTF-8", $_);
}
close $fh;

This can be tedious. But you can enable Auto-Encoding on a FileHandle with binmode

use utf8;

# Activate UTF-8 Encode on STDOUT
binmode STDOUT, ':utf8';

print "préféré\n";

open my $fh, '<:encoding(UTF-8)', 'text';
while ( <$fh> ) { 
    print $_;
}
close $fh;

Now everything is UTF-8! You also can activate it on STDERR. Remember that if you want to print binary data on STDOUT (for whatever reason) you must disable the Layer.

binmode STDOUT, ':raw';
David Raab
  • 4,433
  • 22
  • 40