I have some text files with a lot of Unicode Hebrew and Greek in them which need to be enclosed within an HTML <span class ="hebrew">...</span>
element. These files belong to a project which has been running for some years.
Around eight years ago we successfully used this Perl script to do the job.
#!/usr/bin/perl
use utf8;
my $table = [
{
FROM => "\\x{0590}",
TO => "\\x{05ff}",
REGEX => "[\\x{0590}-\\x{05ff}]",
OPEN => "<span class =\"hebrew\">",
CLOSE => "</span>",
},
{
FROM => "\\x{0370}",
TO => "\\x{03E1}",
REGEX => "[\\x{0370}-\\x{03E1}]|[\\x{1F00}-\\x{1FFF}]",
OPEN => "<span class =\"greek\">",
CLOSE => "</span>",
},
];
binmode(STDIN,":utf8");
binmode(STDIN,"encoding(utf8)");
binmode(STDOUT,":utf8");
binmode(STDOUT,"encoding(utf8)");
while (<>) {
my $line = $_;
foreach my $l (@$table) {
my $regex = $l->{REGEX},
my ($from, $to) = ($l->{FROM},$l->{TO});
my ($open, $close) = ($l->{OPEN},$l->{CLOSE});
$line =~ s/(($regex)+(\s+($regex)+)*)/$open\1$close/g;
}
print $line;
}
That scans the text file looking for the defined Unicode ranges, and inserts the appropriate span
wrapper.
I haven't used this script for some time, and I now need to process some more text files. But somehow the Unicode is not being preserved: the Unicode text is being corrupted instead of being wrapped in <span>
tags.
I need help with a fix before I can proceed.
Here's some sample input
Mary had a little כֶּבֶשׂ, its fleece was white as χιών. And πάντα that Mary went, the כֶּבֶשׂ was sure to go.
And here's what I'm getting as output:
Mary had a little ×Ö¼Ö¶×ֶש×, its fleece was white as ÏιÏν. And ÏάνÏα that Mary went, the ×Ö¼Ö¶×Ö¶×©× was sure to go.
Just at the moment I'm on a machine with Linux Mint 13 LTS. My other OS is Ubuntu 14.04. The Perl version is reported as v. 5.14.2. I'm running the script like this
perl uconv.pl infile.txt > outfile.txt
I'm not sure what's happening, and in spite of looking at quite a few Stack Overflow questions and answers (this one for example), I'm none the wiser. Perhaps I need to set some environment variable? Or is something in that script now deprecated? Or...?