To handle UTF-8 on the command line using Perl, we must consider STDIN, STDOUT, STDERR, the arguments, and the source code (given as an argument to the -e
or -E
option).
Consider the following test case:
echo -n "одобрение за" | perl -Mstrict -w -E '
while (<STDIN>){ s/\X\K/ /g; say; }
say "Arguments and their length:";
say " $_\t", length("$_") foreach @ARGV;
say "Length of in the source code is ", length("");
' a
This is a good test case because it has UTF-8 encoded characters in 3 places:
- on STDIN,
- as arguments, and
- in the source code (provided as a argument to the
-E
option).
(BTW, my terminal is in a UTF-8 locale.)
Result:
� � � � � � � � � � � � � � � � � � � � � �
Arguments and their length:
a 1
4
Length of in the source code is 4
First, let's get rid of the question marks. Let's inform perl that the standard streams are UTF-8 encoded characters. To do this, add -CSD
:
echo -n "одобрение за" | perl -Mstrict -w -CSD -E '
while (<STDIN>){ s/\X\K/ /g; say; }
say "Arguments and their length:";
say " $_\t", length("$_") foreach @ARGV;
say "Length of in the source code is ", length("");
' a
Note: I could have simply used -C
because -C
implies -CSDL
which, on a system in a UTF-8 locale, is the same as -CSD
, as explained at perlrun.
Result:
о д о б р е н и е з а
Arguments and their length:
a 1
ð 4
Length of ð in the source code is 4
Good, that got rid of the question marks.
But now the emoji in the arguments and in the source code is messed up.
We must inform perl that our arguments are UTF-8. We do this by changing -CSD
to -CSDA
:
echo -n "одобрение за" | perl -Mstrict -w -CSDA -E '
while (<STDIN>){ s/\X\K/ /g; say; }
say "Arguments and their length:";
say " $_\t", length("$_") foreach @ARGV;
say "Length of in the source code is ", length("");
' a
Result:
о д о б р е н и е з а
Arguments and their length:
a 1
1
Length of ð in the source code is 4
Good. The emoji argument is fixed and it's length is 1 character, as expected.
The emoji in the source code still is problematic.
To inform perl that the source code is encoded as UTF-8, add use utf8;
to the source code or -Mutf8
to the command line options:
echo -n "одобрение за" | perl -Mutf8 -Mstrict -w -CSDA -E '
while (<STDIN>){ s/\X\K/ /g; say; }
say "Arguments and their length:";
say " $_\t", length("$_") foreach @ARGV;
say "Length of in the source code is ", length("");
' a
Result:
о д о б р е н и е з а
Arguments and their length:
a 1
1
Length of in the source code is 1
Good, now we get the expected result for the emoji character located in the source code.
Summary:
- Add
-CSD
to inform perl that the standard streams are UTF-8 encoded.
- Change that to
-CSDA
to handle UTF-8 encoded arguments too.
- Add
use utf8;
to the source code or add -Mutf8
to the options to inform perl that the source code is UTF-8 encoded.