Trouble in Perl using split() function on tab chars when string contains non-Latin characters

Question

I'm working on modifying a Perl script that reads in a series of UCS-2LE encoded files with strings in a tab-delimited format, but I am having trouble splitting the strings on the tab character when the string contains characters outside of the extended Latin character set.

Here is a sample line that I'm reading in from these files (tab-delimited):

adını   transcript  asr turkish

When I had my script write these lines to the output file to try and debug this issue, this is what it's writing:

ad1Ů1ĉtranscript    asr turkish

It appears that it's not recognizing the tab character after the Turkish character. This only happens when the word ends with a non-Latin character (and so is adjacent to the tab).

Here is a part of the code block where the writing to the output file happens and string-splitting happens:

for my $infile (@ARGV){  
    if (!open (INFILE, "<$infile")){
        die "Couldn't open $infile.\n";
    }    

binmode (OUTFILE, ":utf8");

while (<INFILE>) {
    chomp;
    $tTot++;

    if ($lineNo == 1) {                
        $_ = decode('UCS-2LE', $_);      
    }
    else {
        $_ = decode('UCS-2', $_);
    }    

    $_ =~ s/[\r\n]+//g;    
    my @foo = split('\t');

    my $orth = $foo[0];
    my $tscrpt = $foo[1];
    my $langCode = $foo[3];

    if (exists $codeHash{$langCode}) {
      unless ($tscrpt eq '') {
        check($orth, $tscrpt, $langCode);
      }
    }
    else {
        print OUTFILE "Unknown language code $langCode at line $lineNo.\n";
        print OUTFILE $_; # printing the string that's not being split correctly
        print OUTFILE "\n";
        $tBad++;
    }
  }

The purpose of this script is to check that, for each line in the input file, the language code is valid, and, based on that code, check whether the transcription for each word is "legal" according to our transcription system.

Here's what I've tried so far:

Changing the encoding of the input strings as they're read in to UTF-8, UTF-16 or UTF-16LE
Changing the split() character to '\w', /[[:blank:]]/, \p{Blank}, \x{09}, and \N{U+0009}.
Reading Perl Unicode & perlrebackslash documentation and any other remotely relevant posts I've been able to find on various sites

Does anyone have any suggestions as to other things I might try? Thanks in advance!

I should also mention that I have no control over the input file encoding nor the output file encoding; I have to read in UCS-2LE and output UTF-8.

You should be able demonstrate a problem with `split` in under 5 lines. And don't omit the lines handling the encoding of the output in your update. Also, please provide the input for which your upcoming demonstration is failing. `od -t x1 file` will provide it in format that won't get corrupted. — ikegami, Oct 28 '13 at 19:46
btw, `$_ = decode('UCS-2LE', $_); s/^\x{FEFF}//;` would be a simpler way to decode your file. Even better would be to use `'<:raw:encoding(UCS-2le):crlf'` instead of `'<'` in the `open`. — ikegami, Oct 28 '13 at 19:48
Your `split` to `@foo` seems to be mostly unrelated to your output, with the exception of triggering a couple error messages. Nothing below the `s` statement seems to have any affect on your printing your (unencoded) `$_`. — tjd, Oct 28 '13 at 19:48
btw, UCS-2le is a subset of UTF-16le. It's probably best to decode using UTF-16le in case what you have is actually UTF-16le. — ikegami, Oct 28 '13 at 19:51
@ikegami Thanks for your tips. I'll try opening the file with UFT-16LE. Also, because I'm new at this, could you tell me which parts of the post were not useful, so that next time I can be concise without leaving out the important parts? Thanks! — mariera, Oct 28 '13 at 20:03
@mariera: would help to know how you open the input file also. If you need to muck w/decode() then open in :raw mode. — runrig, Oct 28 '13 at 20:04
@tjd Sorry for not explaining the reason for the split! My bad. I edited my post to explain why I'm splitting the string. When the string doesn't split properly, I get the error message "Unknown language code", and so when that happens, I write the string to the output to see what's going on with it. — mariera, Oct 28 '13 at 20:06
@runrig I added the part of the code above the `while` statement where the input files are iterated through and opened. — mariera, Oct 28 '13 at 20:09
Re "could you tell me which parts of the post were not useful", well, if it's a problem with splitting, you should be showing the code that does the splitting, the data you had before split, what you got after the split, and what you expected after the split. (Data::Dumper with `$Data::Dumper::Useqq = 1;` can be useful.) Now, it could be that it's not a problem with splitting. If so, the above will reveal that and help you locate the actual problem. — ikegami, Oct 28 '13 at 20:14
@ikegami Thank you!!! Data::Dumper was incredibly helpful. I don't have a solution yet, but at least I can see what's actually going on. I didn't know something like Data::Dumper existed (this is my first time working with Perl), so thanks! — mariera, Oct 28 '13 at 20:46

score 1 · Answer 1 · answered Oct 28 '13 at 19:57

You should start by opening the file with the correct encoding (not that I know whether or not this is the correct one, but I'm taking your word for it). Then you do not need to call decode():

open(my $fh, "<:encoding(UCS-2LE)", $file) or die "Error opening $file: $!";
while (<$fh>) {
  ...
}

score 0 · Accepted Answer · answered Oct 30 '13 at 17:59

Thanks to everyone's comments and some further research, I figured out how to solve the problem and it was slightly different than I thought; it turned out to be a combination of a split() issue and an encoding issue. I had to both add the encoding in an explicit open statement instead of using the implicit open in the for loop, and skip the first two bytes at the beginning of the file.

Here's what the corrected, working code looks like for the section I posted in my question:

for my $infile (@ARGV){
    my $outfile = $infile . '.out';

    # SOLUTION part 1: added explicit open statement
    open (INFILE, "<:raw:encoding(UCS-2le):crlf", $infile) or die "Error opening $infile: $!";

    # SOLUTION part 2: had to skip the first two bytes of the file 
    seek INFILE, 2, 0;

    if (!open (OUTFILE, ">$outfile")) {
        die "Couldn't write to $outfile.\n";
    }

    binmode (OUTFILE, ":utf8");
    print OUTFILE "Line#\tOriginal_Entry\tLangCode\tOffending_Char(s)\n";

    $tBad = 0;
    $tTot = 0;
    $lineNo = 1;

while (<INFILE>) {
    chomp;
    $tTot++;

    # SOLUTION part 3: deleted the "if" block I had here before that was handling encoding

    # Rest of code in the original block is the same    
}

My code now properly recognizes tab characters adjacent to characters not part of the extended Latin set, and splits on tabs as it should.

NOTE: Another solution would have been to enclose the foreign words in double quotes, but, in our case, we couldn't guarantee that our input files would be formatted that way.

Thanks to everyone who commented and helped me out!

Trouble in Perl using split() function on tab chars when string contains non-Latin characters

2 Answers2