2

I have a multiline $string variable that contains UTF-8 csv. I open this string as a file for processing and print its contents.

open(my $fh, "<", \$string);
$/=undef;
say <$fh>;

With hexdump I see the text is UTF-8 (É is c3 89).

Now I read the string through Text::CSV.

my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
my $line;
$csv->say(\*STDOUT, $line) while ($line = $csv->getline($fh));

É char has become c9 (Unicode?). If I print that to my console I'm getting instead of É.

I use perl 5.28.0.

Why is Text::CSV altering encoding and how to avoid it?

EDIT

I've made progress, thanks to @Gilles Quénot and @ikegami, and some trial and error.

What happened is that Text::CSV converted my strings into perl internal format. Strings in perl's internal format won't be output correctly to my utf8 terminal unless I use open ':std', ':encoding(UTF-8)';. This directive is apparently needed in my program main file only.

Another problem I had (absent from my example) was that I needed use utf8 in all source files to convert my program literals into perl internal format. Without it, comparisons such as "É" eq $some_var fail because the former will be utf8 (because of my editor saving to that format) and the latter will be in perl's internal format.

Another problem I encountered was stacked decoding. Once use open ':std', ':encoding(UTF-8)'; is in place, any other encoding instruction must be removed from the program (the symptom I had: chars output as 4 bytes instead of 2).

EDIT 2

Here are simple tests that really helped me understand.

# no conversion to internal perl string format
$ perl -M'5.28.0' -e 'say "É"' | hexdump -C
00000000  c3 89 0a                                          |...|
00000003

# string literals converted to perl string format,
# but no conversion of output to terminal
# results in �
$ perl -Mutf8 -M'5.28.0' -e 'say "É"' | hexdump -C
00000000  c9 0a                                             |..|
00000002

# string literals converted to perl string format,
# AND conversion of output
$ perl -Mutf8 -M'open ":std", ":encoding(UTF-8)"' -M'5.28.0' -e 'say "É"' |hexdump -C
00000000  c3 89 0a                                          |...|
00000003

And finally

# entirely transparent because input is decoded 
# and reencoded on output
# use utf8 has no effect in this very basic example
$ echo É | perl -Mutf8 -M'open ":std", ":encoding(UTF-8)"' -M'5.28.0' -pne '' |hexdump -C
00000000  c3 89 0a                                          |...|
00000003

We have to assume strings are converted to perl internal format at some point.

Philippe A.
  • 2,885
  • 2
  • 28
  • 37

2 Answers2

2

Try to add this line after the shebang:

# Tell Perl your code is encoded using UTF-8.
use utf8;

# Tell Perl input and output is encoded using UTF-8.
use open ':std', ':encoding(UTF-8)';

See
https://stackoverflow.com/a/15147306/465183
https://perldoc.perl.org/feature#The-'unicode_strings'-feature
Why does modern Perl avoid UTF-8 by default?

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
2

Text::CSV's decode_utf8 option, which is true by default, causes the input to be decoded. This is good. The bug is that you forgot to encode your outputs.

In this case, this can be achieved using the following (assuming a UTF-8 terminal):

use open ":std", ":encoding(UTF-8)";
ikegami
  • 367,544
  • 15
  • 269
  • 518