Perl character encoding confusion

Question

I have a multiline $string variable that contains UTF-8 csv. I open this string as a file for processing and print its contents.

open(my $fh, "<", \$string);
$/=undef;
say <$fh>;

With hexdump I see the text is UTF-8 (É is c3 89).

Now I read the string through Text::CSV.

my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
my $line;
$csv->say(\*STDOUT, $line) while ($line = $csv->getline($fh));

É char has become c9 (Unicode?). If I print that to my console I'm getting � instead of É.

I use perl 5.28.0.

Why is Text::CSV altering encoding and how to avoid it?

EDIT

I've made progress, thanks to @Gilles Quénot and @ikegami, and some trial and error.

What happened is that Text::CSV converted my strings into perl internal format. Strings in perl's internal format won't be output correctly to my utf8 terminal unless I use open ':std', ':encoding(UTF-8)';. This directive is apparently needed in my program main file only.

Another problem I had (absent from my example) was that I needed use utf8 in all source files to convert my program literals into perl internal format. Without it, comparisons such as "É" eq $some_var fail because the former will be utf8 (because of my editor saving to that format) and the latter will be in perl's internal format.

Another problem I encountered was stacked decoding. Once use open ':std', ':encoding(UTF-8)'; is in place, any other encoding instruction must be removed from the program (the symptom I had: chars output as 4 bytes instead of 2).

EDIT 2

Here are simple tests that really helped me understand.

# no conversion to internal perl string format
$ perl -M'5.28.0' -e 'say "É"' | hexdump -C
00000000  c3 89 0a                                          |...|
00000003

# string literals converted to perl string format,
# but no conversion of output to terminal
# results in �
$ perl -Mutf8 -M'5.28.0' -e 'say "É"' | hexdump -C
00000000  c9 0a                                             |..|
00000002

# string literals converted to perl string format,
# AND conversion of output
$ perl -Mutf8 -M'open ":std", ":encoding(UTF-8)"' -M'5.28.0' -e 'say "É"' |hexdump -C
00000000  c3 89 0a                                          |...|
00000003

And finally

# entirely transparent because input is decoded 
# and reencoded on output
# use utf8 has no effect in this very basic example
$ echo É | perl -Mutf8 -M'open ":std", ":encoding(UTF-8)"' -M'5.28.0' -pne '' |hexdump -C
00000000  c3 89 0a                                          |...|
00000003

We have to assume strings are converted to perl internal format at some point.

Gilles Quénot · Accepted Answer · 2023-03-04T22:47:34.230

2

Try to add this line after the shebang:

# Tell Perl your code is encoded using UTF-8.
use utf8;

# Tell Perl input and output is encoded using UTF-8.
use open ':std', ':encoding(UTF-8)';

See
https://stackoverflow.com/a/15147306/465183
https://perldoc.perl.org/feature#The-'unicode_strings'-feature
Why does modern Perl avoid UTF-8 by default?

edited Mar 04 '23 at 22:47

answered Mar 04 '23 at 22:36

Gilles Quénot

173,512
41
224
223

Works but I am not sure I understand what is going on. Looks like Text::CSV flips char representation (why?) and binmode flips it again? Also, use utf8 isn't critical (tested out of curiosity). – Philippe A. Mar 04 '23 at 22:44
You explicitly tell Perl to use utf8 for STDIN STDOUT STDERR – Gilles Quénot Mar 04 '23 at 22:45
Changed the way to use a more acceptable solution – Gilles Quénot Mar 04 '23 at 22:47

ikegami · Answer 2 · 2023-03-06T05:30:24.573

2

Text::CSV's decode_utf8 option, which is true by default, causes the input to be decoded. This is good. The bug is that you forgot to encode your outputs.

In this case, this can be achieved using the following (assuming a UTF-8 terminal):

use open ":std", ":encoding(UTF-8)";

edited Mar 06 '23 at 05:30

answered Mar 05 '23 at 01:46

ikegami

367,544
15
269
518

Perl character encoding confusion

2 Answers2