I'm working on modifying a Perl script that reads in a series of UCS-2LE encoded files with strings in a tab-delimited format, but I am having trouble splitting the strings on the tab character when the string contains characters outside of the extended Latin character set.
Here is a sample line that I'm reading in from these files (tab-delimited):
adını transcript asr turkish
When I had my script write these lines to the output file to try and debug this issue, this is what it's writing:
ad1Ů1ĉtranscript asr turkish
It appears that it's not recognizing the tab character after the Turkish character. This only happens when the word ends with a non-Latin character (and so is adjacent to the tab).
Here is a part of the code block where the writing to the output file happens and string-splitting happens:
for my $infile (@ARGV){
if (!open (INFILE, "<$infile")){
die "Couldn't open $infile.\n";
}
binmode (OUTFILE, ":utf8");
while (<INFILE>) {
chomp;
$tTot++;
if ($lineNo == 1) {
$_ = decode('UCS-2LE', $_);
}
else {
$_ = decode('UCS-2', $_);
}
$_ =~ s/[\r\n]+//g;
my @foo = split('\t');
my $orth = $foo[0];
my $tscrpt = $foo[1];
my $langCode = $foo[3];
if (exists $codeHash{$langCode}) {
unless ($tscrpt eq '') {
check($orth, $tscrpt, $langCode);
}
}
else {
print OUTFILE "Unknown language code $langCode at line $lineNo.\n";
print OUTFILE $_; # printing the string that's not being split correctly
print OUTFILE "\n";
$tBad++;
}
}
The purpose of this script is to check that, for each line in the input file, the language code is valid, and, based on that code, check whether the transcription for each word is "legal" according to our transcription system.
Here's what I've tried so far:
- Changing the encoding of the input strings as they're read in to UTF-8, UTF-16 or UTF-16LE
- Changing the split() character to '\w', /[[:blank:]]/, \p{Blank}, \x{09}, and \N{U+0009}.
- Reading Perl Unicode & perlrebackslash documentation and any other remotely relevant posts I've been able to find on various sites
Does anyone have any suggestions as to other things I might try? Thanks in advance!
I should also mention that I have no control over the input file encoding nor the output file encoding; I have to read in UCS-2LE and output UTF-8.