3

I have a lot of files that have Cyrillic filenames, for example Deceasedя0я0.25я3.xgboost.json

I read these files in with a function:

use Devel::Confess 'color'
use utf8;
use autodie ':all';
use open ':std', ':encoding(UTF-8)';

sub json_file_to_ref {
    my $json_filename = shift;
    open my $fh, '<:raw', $json_filename; # Read it unmangled
    local $/;                     # Read whole file
    my $json = <$fh>;             # This is UTF-8
    my $ref = decode_json($json); # This produces decoded text
    return $ref;                  # Return the ref rather than the keys and values.
}

which I got from perl & python writing out non-ASCII characters into JSON differently

but the problem is that Perl will read the files like DeceasedÑ0Ñ0.2Ñ3.xgboost.json, i.e. translating я to Ñ which means the files won't show up when I do a regex search.

the file names are read thus:

sub list_regex_files {
    my $regex = shift;
    my $directory = '.';
    if (defined $_[0]) {
        $directory = shift
    }
    my @files;
    opendir (my $dh, $directory);
    $regex = qr/$regex/;
    while (my $file = readdir $dh) {
        if ($file !~ $regex) {
            next
        }
        if ($file =~ m/^\.{1,2}$/) {
            next
        }
        my $f = "$directory/$file";
        if (-f $f) {
            if ($directory eq '.') {
                push @files, $file
            } else {
                push @files, $f
            }
        }
    }
    @files
}

However, I can get the files to show up with the regex search if I comment out

use utf8;
use open ':std', ':encoding(UTF-8)';

but then when I try to read the files in ( the following error is for a different file),

Use of uninitialized value $/ in string eq at /home/con/perl5/perlbrew/perls/perl-5.32.1/lib/5.32.1/Carp.pm line 605, <$_[...]> chunk 1.
Use of uninitialized value $/ in string eq at /home/con/perl5/perlbrew/perls/perl-5.32.1/lib/5.32.1/Carp.pm line 605, <$_[...]> chunk 1.
Use of uninitialized value $/ in string eq at 4.best.params.pl line 32, <$_[...]> chunk 1.
    main::json_file_to_ref("data/Deceased\x{d1}\x{8f}0\x{d1}\x{8f}0.15\x{d1}\x{8f}3.xgboost.json") called at 4.best.params.pl line 140

I've seen similar posts like How do I write a file whose *filename* contains utf8 characters in Perl? and Perl newbie first experience with Unicode (in filename, -e operator, open operator, and cmd window) but I'm not using Windows.

I've also tried use feature 'unicode_strings' to no avail.

I've also tried

use Encode 'decode_utf8';
sub json_file_to_ref {
    my $json_filename = shift;
    open my $fh, '<:raw', decode_utf8($json_filename); # Read it unmangled
    local $/;                     # Read whole file
    my $json = <$fh>;             # This is UTF-8
    my $ref = decode_json($json); # This produces decoded text
    return $ref;                  # Return the ref rather than the keys and values.
}

but this produces the same error message.

I've also tried

use open ':std', ':encoding(cp866)';
use open IO => ':encoding(cp1251)';

as suggested in Reading Cyrillic characters from file in perl

but this too fails.

How can I get Linux Perl to read the filenames as they're written through that subroutine?

con
  • 5,767
  • 8
  • 33
  • 62
  • Where is the code that reads the filename? – stark Mar 11 '21 at 15:27
  • @stark I've edited the question to include how filenames are read – con Mar 11 '21 at 15:29
  • readdir returns a byte string, which is the form that must be used for opening a file. To compare it to a text string you should make a decoded copy for use with your regex. See https://www.perlmonks.org/?node_id=583736 – stark Mar 11 '21 at 15:40
  • @stark `my $file = decode('utf8', readdir $dh)` doesn't work :( – con Mar 11 '21 at 16:47
  • Correct. The decoded string can't be used as a filename. – stark Mar 11 '21 at 17:51
  • 1
    If you 'use utf8;', does `print decode('utf8', $filename);` at least print the file name with Cyrillic characters? I think the file name as read from `readdir()` is what you need to use to pass to `open()` and other file-based functions. I think you should `decode()` the filename to `'utf8'` before trying to match it to your regex. Basically, I think you need to have two different file name strings, the one you get from `readdir()` and the `'utf8'` version and use the appropriate one depending on what you're doing. See https://www.perlmonks.org/?node_id=583752 – Ed Sabol Mar 12 '21 at 04:52
  • 1
    See also [In what encoding does readdir return a filename?](https://stackoverflow.com/q/37027051/2173773) – Håkon Hægland Mar 12 '21 at 23:03

1 Answers1

2

As @Ed Sabol pointed out, the problem was with file characters, and how the files were being read.

the key line to change is readdir $dh to decode_utf8(readdir $dh) this allows Perl to handle the non-Latin (Cyrillic) filenames. The Encode library should also be loaded: use Encode 'decode_utf8';

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';
use autodie ':all';
use Devel::Confess 'color';
use feature 'say';
use JSON 'decode_json';
use utf8;
use DDP;
use Devel::Confess 'color';
use Encode 'decode_utf8'; # necessary for Cyrillic characters
use open ':std', ':encoding(UTF-8)';    # For say to STDOUT.  Also default for open()

sub json_file_to_ref {
    my $json_filename = shift;
    open my $fh, '<:raw', $json_filename; # Read it unmangled
    local $/;                     # Read whole file
    my $json = <$fh>;             # This is UTF-8
    my $ref = decode_json($json); # This produces decoded text
    return $ref;                  # Return the ref rather than the keys and values.
}

sub list_regex_files {
    my $regex = shift;
    my $directory = '.';
    if (defined $_[0]) {
        $directory = shift
    }
    my @files;
    opendir (my $dh, $directory);
    $regex = qr/$regex/;
    while (my $file = decode_utf8(readdir $dh)) {
        if ($file !~ $regex) {
            next
        }
        if ($file =~ m/^\.{1,2}$/) {
            next
        }
        my $f = "$directory/$file";
        if (-f $f) {
            if ($directory eq '.') {
                push @files, $file
            } else {
                push @files, $f
            }
        }
    }
    @files
}
my @files = list_regex_files('я.json$');
p @files;

my $data = json_file_to_ref('я.json');
p $data;

As an aside, with Perl7 coming out soon, non-Latin character handling may seem a sensible default that should be changed

con
  • 5,767
  • 8
  • 33
  • 62
  • 2
    The core issue is that a file system can store a file name *any way it wants*. It would be nice otherwise, but Perl has no way of knowing this. UTF-8, UTF-16, EBCDIC, who knows. The file system drive can likewise accept who knows what. Welcome to interoperability. – lordadmira Mar 13 '21 at 00:14
  • 2
    You can always use the core `B` module function `B::perlstring()` to have Perl spit back to you what is actually in your string to see if it comports with what you think. You can't just `print()` because your terminal will likely interpret raw bytes and show you what *looks like* valid unicode text. `perlstring()` shows an unambiguous representation using escapes for non-ASCII. – lordadmira Mar 13 '21 at 00:18