I have a lot of files that have Cyrillic filenames, for example Deceasedя0я0.25я3.xgboost.json
I read these files in with a function:
use Devel::Confess 'color'
use utf8;
use autodie ':all';
use open ':std', ':encoding(UTF-8)';
sub json_file_to_ref {
my $json_filename = shift;
open my $fh, '<:raw', $json_filename; # Read it unmangled
local $/; # Read whole file
my $json = <$fh>; # This is UTF-8
my $ref = decode_json($json); # This produces decoded text
return $ref; # Return the ref rather than the keys and values.
}
which I got from perl & python writing out non-ASCII characters into JSON differently
but the problem is that Perl will read the files like DeceasedÑ0Ñ0.2Ñ3.xgboost.json
, i.e. translating я
to Ñ
which means the files won't show up when I do a regex search.
the file names are read thus:
sub list_regex_files {
my $regex = shift;
my $directory = '.';
if (defined $_[0]) {
$directory = shift
}
my @files;
opendir (my $dh, $directory);
$regex = qr/$regex/;
while (my $file = readdir $dh) {
if ($file !~ $regex) {
next
}
if ($file =~ m/^\.{1,2}$/) {
next
}
my $f = "$directory/$file";
if (-f $f) {
if ($directory eq '.') {
push @files, $file
} else {
push @files, $f
}
}
}
@files
}
However, I can get the files to show up with the regex search if I comment out
use utf8;
use open ':std', ':encoding(UTF-8)';
but then when I try to read the files in ( the following error is for a different file),
Use of uninitialized value $/ in string eq at /home/con/perl5/perlbrew/perls/perl-5.32.1/lib/5.32.1/Carp.pm line 605, <$_[...]> chunk 1.
Use of uninitialized value $/ in string eq at /home/con/perl5/perlbrew/perls/perl-5.32.1/lib/5.32.1/Carp.pm line 605, <$_[...]> chunk 1.
Use of uninitialized value $/ in string eq at 4.best.params.pl line 32, <$_[...]> chunk 1.
main::json_file_to_ref("data/Deceased\x{d1}\x{8f}0\x{d1}\x{8f}0.15\x{d1}\x{8f}3.xgboost.json") called at 4.best.params.pl line 140
I've seen similar posts like How do I write a file whose *filename* contains utf8 characters in Perl? and Perl newbie first experience with Unicode (in filename, -e operator, open operator, and cmd window) but I'm not using Windows.
I've also tried use feature 'unicode_strings'
to no avail.
I've also tried
use Encode 'decode_utf8';
sub json_file_to_ref {
my $json_filename = shift;
open my $fh, '<:raw', decode_utf8($json_filename); # Read it unmangled
local $/; # Read whole file
my $json = <$fh>; # This is UTF-8
my $ref = decode_json($json); # This produces decoded text
return $ref; # Return the ref rather than the keys and values.
}
but this produces the same error message.
I've also tried
use open ':std', ':encoding(cp866)';
use open IO => ':encoding(cp1251)';
as suggested in Reading Cyrillic characters from file in perl
but this too fails.
How can I get Linux Perl to read the filenames as they're written through that subroutine?