Perl UTF-8 encoding on DATA and ARGV file handles

Question

I have some text files with a lot of Unicode Hebrew and Greek in them which need to be enclosed within an HTML <span class ="hebrew">...</span> element. These files belong to a project which has been running for some years.

Around eight years ago we successfully used this Perl script to do the job.

#!/usr/bin/perl

use utf8;

my $table = [
  {
    FROM  => "\\x{0590}",
    TO    => "\\x{05ff}",
    REGEX => "[\\x{0590}-\\x{05ff}]",
    OPEN  => "<span class =\"hebrew\">",
    CLOSE => "</span>",
  },
  {
    FROM  => "\\x{0370}",
    TO    => "\\x{03E1}",
    REGEX => "[\\x{0370}-\\x{03E1}]|[\\x{1F00}-\\x{1FFF}]",
    OPEN  => "<span class =\"greek\">",
    CLOSE => "</span>",
  },
];

binmode(STDIN,":utf8");
binmode(STDIN,"encoding(utf8)");

binmode(STDOUT,":utf8");
binmode(STDOUT,"encoding(utf8)");

while (<>) {

  my $line = $_;

  foreach my $l (@$table) {

    my $regex          = $l->{REGEX},
    my ($from, $to)    = ($l->{FROM},$l->{TO});
    my ($open, $close) = ($l->{OPEN},$l->{CLOSE});

    $line =~ s/(($regex)+(\s+($regex)+)*)/$open\1$close/g;
  }

  print $line;
}

That scans the text file looking for the defined Unicode ranges, and inserts the appropriate span wrapper.

I haven't used this script for some time, and I now need to process some more text files. But somehow the Unicode is not being preserved: the Unicode text is being corrupted instead of being wrapped in <span> tags.

I need help with a fix before I can proceed.

Here's some sample input

Mary had a little כֶּבֶשׂ, its fleece was white as χιών. And πάντα that Mary went, the כֶּבֶשׂ was sure to go.

And here's what I'm getting as output:

Mary had a little ×Ö¼Ö¶×Ö¶×©×, its fleece was white as ÏÎ¹ÏÎ½. And ÏÎ¬Î½ÏÎ± that Mary went, the ×Ö¼Ö¶×Ö¶×©× was sure to go.

Just at the moment I'm on a machine with Linux Mint 13 LTS. My other OS is Ubuntu 14.04. The Perl version is reported as v. 5.14.2. I'm running the script like this

perl uconv.pl infile.txt > outfile.txt

I'm not sure what's happening, and in spite of looking at quite a few Stack Overflow questions and answers (this one for example), I'm none the wiser. Perhaps I need to set some environment variable? Or is something in that script now deprecated? Or...?

`[\x{0590}-\x{05ff}]` is better written `\p{InHebrew}`. Likewise `[\x{0370}-\x{03E1}]`. The closest property to the Greek characters is `\p{InGreek}`, which includes Coptic characters and extends to U+03FF. — Borodin, Aug 24 '14 at 16:24
@Borodin Is there a specific resource that you would recommend to look up the proper character class for those unicode ranges? — Miller, Aug 24 '14 at 19:13
@Miller: If you just Google for, say `U+263A` then the first option will be the relevant page on `FileFormat.info`, whose [Unicode section](http://www.fileformat.info/info/unicode/index.htm) is full of useful stuff. There's also [Charset tool](http://www.toolcase.org/charset/index.php), which has some very useful tools but is partially in German so you may want to use Google's translation facility on Chrome. Then of course there's [`perluniprops`](http://perldoc.perl.org/perluniprops.html) which lists the names that Perl expects. You can test what a property matches with a `0 .. 0xFFFF` loop. — Borodin, Aug 24 '14 at 19:25
@Davïd: I hope you're happy with my amendment of your question. I intended that my representation would help those searching for the same *solution* to find it, while those with a similar *problem* would be more likely to pass it over if the contents were irrelevant. — Borodin, Aug 24 '14 at 23:26
@Borodin - it's all good. Many thanks! You've made the web a better place. :) — Dɑvïd, Aug 25 '14 at 06:43

Borodin · Accepted Answer · 2014-08-24T22:31:07.237

Your output is fine. Perl is printing the correct byte sequences for the UTF-8-encoded string.

For instance, the first Hebrew word כֶּבֶשׂ contains these seven unicode characters

05DB   05BC   05B6   05D1   05B6   05E9   05C2
kaf    dagesh segol  bet    segol  shin   sin dot

which is encoded in UTF-8 as the fourteen bytes (two per character)

[D7 9B] [D6 BC] [D6 B6] [D7 91] [D6 B6] [D7 A9] [D7 82]

and that is the contents of the malformed string that you show.

The problem isn't that the program is printing the wrong characters, but that whatever you are using to examine the output isn't expecting UTF-8.

Update

It looks like the problem is with ARGV, not STDIN. Reading from the null file handle actually reads from ARGV, so setting a UTF-8 Perl IO layer on STDIN with binmode, as you have done, has no effect. Also, you can't set the mode of ARGV in the same way because it's not yet open.

But you can fix this by using

use open qw/ :std :encoding(utf8) /;

which specifies the default layers to be applied to newly open input (and output) handles, including ARGV. So when it is opened automatically on the first execution of <> your data should be read properly.

Update

It has also just dawned on me why the output text was wrong.

My wrong thinking was that, even if the input was read as a sequence of octets instead of UTF-8-encoded wide characters, it should still produce the correct result if those same octets were copied, unmodified, to the output.

What is now glaringly obvious is that while the input is in bytes, STDOUT is set to UTF-8 encoding, so the already-encoded data will be reencoded. Let's take this Hebrew word for lamb from above

[D7 9B] [D6 BC] [D6 B6] [D7 91] [D6 B6] [D7 A9] [D7 82]

Because ARGV was still set to :raw, the input was interpreted as these fourteen single-byte characters instead of as seven UTF-8-encoded wide characters

D7 9B D6 BC D6 B6 D7 91 D6 B6 D7 A9 D7 82

Now, if that string is printed then it will be encoded into UTF-8 because that is how STDOUT has been set. ASCII (seven-bit) characters would survive UTF-8 encoding untouched, but all of the “characters” in this string are at code point 0x80 or higher, so they will be encoded as multi-byte characters.

The result of encoding those fourteen “characters” is this series of twenty-eight octets

[C3 97] [C2 9B] [C3 96] [C2 BC] [C3 96] [C2 B6] [C3 97] [C2 91] [C3 96] [C2 B6] [C3 97] [C2 A9] [C3 97] [C2 82]

which, when displayed as a UTF8-encoded string, will appear as the fourteen nonsense “characters” that were the result of reading from ARGV without decoding.

Erm, QED I think.

That's a help, certainly - although I'm still puzzled, as any text editor I use (with UTF-8 as encoding) doesn't represent the Unicode characters ... nor are the `` tags being added. Any thoughts? — Dɑvïd, Aug 24 '14 at 17:06
The tags are working fine for me. If they're not being added then it's probably because you're not reading the file as UTF-8, so the wide characters aren't being recognised. Try printing the text right after reading it, before performing any subtitution. Also try adding `use open qw/ :std :encoding(utf8) /;` to the top of your program. — Borodin, Aug 24 '14 at 18:15
Yes, `use open` should do it. The problem is that reading from the null file handle using `<>` actually reads from `ARGV`, not `STDIN`. I've just done some experimentation, and setting a UTF-8 layer on `STDIN` (and `STDOUT`) with `binmode` as you have done has no effect on `ARGV`, but if you do it with `use open` instead then the layers propagate to `ARGV` correctly. — Borodin, Aug 24 '14 at 18:26
The `use open...` line did it! Thanks for this, and for the informative explanations, *and* the "value added" on the unicode ranges, too. Invaluable help! — Dɑvïd, Aug 24 '14 at 20:25
@Davïd: You're welcome. I hadn't fully understood myself how `use open ':std'` works (the `DATA` file handle behaves the same way as `ARGV`) so it was useful for me as well. By the way, it is vital to always `use strict` and `use warnings` at the top of every Perl program you write, particularly if you are asking for help with it. It is a simple measure that will save you countless hours of debugging. Also, strictly speaking your `binmode(STDOUT,"encoding(utf8)")` is wrong: the mode should have a leading colon, as in `:encoding(utf8)`. — Borodin, Aug 24 '14 at 21:28
@Davïd: Regarding your edit of my answer. It is part of my posting style that I never put a semicolon at the end of isolated Perl statements. They appear in a Perl program only to separate statements from one another, and otherwise are just needless noise. Even within a Perl program it is usual to use `map { s/\A\h+|\h+\z/g } @list` to `map { s/\A\h+|\h+\z/g; } @list`, but I admit that that has more to do with the end of the statement not being the end of the text line. — Borodin, Aug 24 '14 at 21:37
Re semicolon - ah, right. I copied the non-semicolon line and it threw an error, and had to come back to your comment to see the semicolon (which felt like a "doh!" moment). So I thought adding the `;` in the answer, too, might save some other numpty like me a little aggro. Btw - wish I could UV again - lots of help here. Thanks again! — Dɑvïd, Aug 25 '14 at 06:47

Perl UTF-8 encoding on DATA and ARGV file handles

1 Answers1