17

How can I handle utf8 using Perl (or Python) on the command line?

I am trying to split the characters in each word, for example. This is very easy for non-utf8 text, for example:

$ echo "abc def" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
a b c   d e f

But with utf8 it doesn't work, of course:

$ echo "одобрение за" | perl -ne 'my @letters = m/(.)/g; print "@letters\n"' | less
<D0> <BE> <D0> <B4> <D0> <BE> <D0> <B1> <D1> <80> <D0> <B5> <D0> <BD> <D0> <B8> <D0> <B5>   <D0> <B7> <D0> <B0>

because it doesn't know about the 2-byte characters.

It would also be good to know how this (i.e., command-line processing of utf8) is done in Python.

Frank
  • 64,140
  • 93
  • 237
  • 324
  • `$ sed 's/./& /g' <<< "одобрение за"` `о д о б р е н и е з а ` – Ignacio Vazquez-Abrams Mar 16 '12 at 02:02
  • 1
    @Ignacio Vazquez-Abrams: `sed 's/./& /g'` doesn't work for graphemes (it matters if a text contains combined characters, for example, `"Солжени́цын"`). In Perl, Python it can be solved using `/\X/` regex. – jfs Mar 16 '12 at 03:02

5 Answers5

28

The "-C" flag controls some of the Perl Unicode features (see perldoc perlrun):

$ echo "одобрение за" | perl -C -pe 's/.\K/ /g'
о д о б р е н и е   з а 

To specify encoding used for stdin/stdout you could use PYTHONIOENCODING environment variable:

$ echo "одобрение за" | PYTHONIOENCODING=utf-8 python -c'import sys
for line in sys.stdin:
    print " ".join(line.decode(sys.stdin.encoding)),
'
о д о б р е н и е   з а 

If you'd like to split the text on characters (grapheme) boundaries (not on codepoints as the code above) then you could use /\X/ regular expression:

$ echo "одобрение за" | perl -C -pe 's/\X\K/ /g'
о д о б р е н и е   з а 

See Grapheme Cluster Boundaries

In Python \X is supported by regex module.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • 1
    @Frank: [`\K keeps the stuff left of it`](http://perldoc.perl.org/perlre.html#(%3f%3c%3dpattern)-%5cK) – jfs Mar 19 '12 at 07:16
6

"Hey", I thought, "how difficult could this be in Perl?"

Turns out it's pretty easy. Unfortunately, finding out how took me longer than I thought.

A quick glance at use utf8 showed me that this is now obsolete. Perl's binmode looked promising, but not quite.

Found there's a Perluniintro which lead me to Perlunicode which said I should look at Perlrun. Then, I found what I was looking for.

Perl has a command line switch -C which switches Perl to Unicode. However, the -C command line switch also requires a few options. You need to specify what's in unicode. There's a convenient chart that shows you the various options. It would appear that perl -C by itself would be fine. This combines various options which is equivalent to -CSDL or -C255. However, that means if your LOCALE isn't set to unicode, Perl won't work in Unicode.

Instead, you should use perl -CSD or -perl -C63.

$ echo "одобрение за" | perl -CSD -ne 'my @letters = m/(.)/g; print "@letters\n"'
о д о б р е н и е   з а

Yup, that works.

You can learn quite a bit just answering a question.

David W.
  • 105,218
  • 39
  • 216
  • 337
  • 1
    +1: you might mean `-CSDA` (to process `@ARGV`), though from the OP the locale can be assumed `utf-8`-based so a mere `-C` is enough. – jfs Mar 16 '12 at 03:27
  • 2
    use utf8 isn't exactly obsolete, it's just that it has only the limited purpose of telling perl your source code is in utf8. You need to do other things to ingest and eject data in utf8. – Alex Mar 16 '12 at 04:08
  • 3
    Well, the utf8 pragma started off much more ambitiously than it ended up. It was conceived as something that would be more like utf8::all. – brian d foy Mar 16 '12 at 09:13
  • @jfs `-C` is not sufficient to process `@ARGV` as UTF-8 encoded strings, even in a UTF-8 locale, because [`-C`](https://perldoc.perl.org/perlrun#-C-%5Bnumber/list%5D) implies `-CSDL`. It doesn't imply the `A`. This is demonstrated in my answer. – Robin A. Meade Mar 12 '22 at 00:27
  • 1
    @RobinA.Meade: yes, `-C` may be not enough in some cases. From your answer `-C` vs. `-CA`: `perl -C -E 'say length("$_") foreach @ARGV' ''` -> `4`. While `perl -CA -E 'say length("$_") foreach @ARGV' ''` -> `1`. – jfs Mar 12 '22 at 07:40
4

I don't know Perl, so I'm answering for Python.

Python doesn't know that the input text is in Unicode. You need to explicitly decode from UTF-8 or whatever it actually is, into Unicode. Then you can use normal Python text processing stuff to process it.

http://docs.python.org/howto/unicode.html

Here's a simple Python 2.x program for you to try:

import sys

for line in sys.stdin:
    u_line = unicode(line, encoding="utf-8")
    for ch in u_line:
        print ch, # print each character with a space after

This copies lines from the standard input, and converts each line to Unicode. The encoding is specified as UTF-8. Then for ch in u_line sets ch to each character. Then print ch, is the easy way in Python 2.x to print a character, followed by a space, with no carriage return. Finally a bare print adds a carriage return.

I still use Python 2.x for most of my work, but for Unicode I would recommend you use Python 3.x. The Unicode stuff is really improved.

Here is the Python 3 version of the above program, tested on my Linux computer.

import sys

assert(sys.stdin.encoding == 'UTF-8')
for line in sys.stdin:
    for ch in line:
        print(ch, end=' ') # print each character with a space after

By default, Python 3 assumes that the input is encoded as UTF-8. By default, Python then decodes that into Unicode. Python 3 strings are always Unicode; there is a special type bytes() used for a string-like object that contains non-Unicode values ("bytes"). This is the opposite of Python 2.x; in Python 2.x, the basic string type was a string of bytes, and a Unicode string was a special new thing.

Of course it isn't necessary to assert that the encoding is UTF-8, but it's a nice simple way to document our intentions and make sure that the default didn't get changed somehow.

In Python 3, print() is now a function. And instead of that somewhat strange syntax of appending a comma after a print statement to make it print a space instead of a newline, there is now a named keyword argument that lets you change the end char.

NOTE: Originally I had a bare print statement after handling the input line in the Python 2.x program, and print() in the Python 3.x program. As J.F. Sebastian pointed out, the code is printing characters from the input line, and the last character will be a newline, so there really isn't a need for the additional print statement.

steveha
  • 74,789
  • 21
  • 92
  • 117
  • Python 3.x unicode stuff really didn't change much. Only the default encoding, and the literals in the code itself, have changed. Also some stuff has been renamed. No new functionality have been added on this regard. – nosklo Mar 16 '12 at 02:30
  • @nosklo, as my second example shows, the defaults are now Unicode-aware in Python 3.x. There is no need to explicitly convert the input string to a Unicode string; you can just process it. That's a pretty important change IMHO. – steveha Mar 16 '12 at 02:47
  • there is already a newline; you don't need a bare `print` statement i.e., `print "\n",` prints the newline by itself. – jfs Mar 16 '12 at 02:57
  • @J.F. Sebastian, when you're right, you're right. I'll shorten the examples. – steveha Mar 16 '12 at 03:26
  • also: 1. `print unicode_string` doesn't work as expected even if `sys.stdout.encoding` is correct ([bug in Python](http://bugs.python.org/issue4947)) so you need either encode to bytes before printing or use `PYTHONIOENCODING` 2. `sys.stdin.encoding` depends on locale so the assertion may fail. – jfs Mar 16 '12 at 03:33
4
$ echo "одобрение за" | python -c 'import sys, codecs ; x = codecs.
getreader("utf-8")(sys.stdin); print u", ".join(x.read().strip())'
о, д, о, б, р, е, н, и, е,  , з, а

or if you want unicode codepoints:

$ echo "одобрение за" | python -c 'import sys, codecs ; x = codecs.
getreader("utf-8")(sys.stdin); print u", ".join("<%04x>" % ord(ch) 
for ch in x.read().strip())'
<043e>, <0434>, <043e>, <0431>, <0440>, <0435>, <043d>, <0438>, 
<0435>, <0020>, <0437>, <0430> 
nosklo
  • 217,122
  • 57
  • 293
  • 297
1

To handle UTF-8 on the command line using Perl, we must consider STDIN, STDOUT, STDERR, the arguments, and the source code (given as an argument to the -e or -E option).

Consider the following test case:

echo -n "одобрение за"  | perl -Mstrict -w -E '
  while (<STDIN>){ s/\X\K/ /g; say; }
  say "Arguments and their length:";
  say "  $_\t", length("$_") foreach @ARGV;
  say "Length of  in the source code is ", length("");
' a 

This is a good test case because it has UTF-8 encoded characters in 3 places:

  1. on STDIN,
  2. as arguments, and
  3. in the source code (provided as a argument to the -E option).

(BTW, my terminal is in a UTF-8 locale.)

Result:

� � � � � � � � � � � � � � � � � �   � � � � 
Arguments and their length:
 a  1
  4
Length of  in the source code is 4

First, let's get rid of the question marks. Let's inform perl that the standard streams are UTF-8 encoded characters. To do this, add -CSD:

echo -n "одобрение за"  | perl -Mstrict -w -CSD -E '
  while (<STDIN>){ s/\X\K/ /g; say; }
  say "Arguments and their length:";
  say "  $_\t", length("$_") foreach @ARGV;
  say "Length of  in the source code is ", length("");
' a 

Note: I could have simply used -C because -C implies -CSDL which, on a system in a UTF-8 locale, is the same as -CSD, as explained at perlrun.

Result:

о д о б р е н и е   з а 
Arguments and their length:
  a 1
  ð 4
Length of ð in the source code is 4

Good, that got rid of the question marks.

But now the emoji in the arguments and in the source code is messed up.

We must inform perl that our arguments are UTF-8. We do this by changing -CSD to -CSDA:

echo -n "одобрение за"  | perl -Mstrict -w -CSDA -E '
  while (<STDIN>){ s/\X\K/ /g; say; }
  say "Arguments and their length:";
  say "  $_\t", length("$_") foreach @ARGV;
  say "Length of  in the source code is ", length("");
' a 

Result:

о д о б р е н и е   з а 
Arguments and their length:
 a  1
  1
Length of ð in the source code is 4

Good. The emoji argument is fixed and it's length is 1 character, as expected.

The emoji in the source code still is problematic.

To inform perl that the source code is encoded as UTF-8, add use utf8; to the source code or -Mutf8 to the command line options:

echo -n "одобрение за"  | perl -Mutf8 -Mstrict -w -CSDA -E '
  while (<STDIN>){ s/\X\K/ /g; say; }
  say "Arguments and their length:";
  say "  $_\t", length("$_") foreach @ARGV;
  say "Length of  in the source code is ", length("");
' a 

Result:

о д о б р е н и е   з а 
Arguments and their length:
 a  1
  1
Length of  in the source code is 1

Good, now we get the expected result for the emoji character located in the source code.

Summary:

  • Add -CSD to inform perl that the standard streams are UTF-8 encoded.
  • Change that to -CSDA to handle UTF-8 encoded arguments too.
  • Add use utf8; to the source code or add -Mutf8 to the options to inform perl that the source code is UTF-8 encoded.
Robin A. Meade
  • 1,946
  • 18
  • 17