Note Code to handle same names in different directories added below
The files to copy need to be found as they aren't given with a path (don't know in which directories they are), but searching anew for each is extremely wasteful, increasing complexity greatly.
Instead, build a hash with a full-path name for each filename first.
One way, with Perl, utilizing the fast core module File::Find
use warnings;
use strict;
use feature 'say';
use File::Find;
use File::Copy qw(copy);
my $source_dir = shift // '/path/to/source'; # give at invocation or default
my $copy_to_dir = '/path/to/destination';
my $file_list = 'file_list_to_copy.txt';
open my $fh, '<', $file_list or die "Can't open $file_list: $!";
my @files = <$fh>;
chomp @files;
my %fqn;
find( sub { $fqn{$_} = $File::Find::name unless -d }, $source_dir );
# Now copy the ones from the list to the given location
foreach my $fname (@files) {
copy $fqn{$fname}, $copy_to_dir
or do {
warn "Can't copy $fqn{$fname} to $copy_to_dir: $!";
next;
};
}
The remaining problem is about filenames that may exists in multiple directories, but we need to be given a rule for what to do then.†
I disregard that a maximal depth is used in the question, since it is unexplained and seemed to me to be a fix related to extreme runtimes (?). Also, files are copied into a "flat" structure (without restoring their orginal hierarchy), taking the cue from the question.
Finally, I skip only directories, while various other file types come with their own issues (copying links around needs care). To accept only plain files change unless -d
to if -f
.
† A clarification came that, indeed, there may be files with the same name in different directories. Those should be copied to same name suffixed with a sequential number before the extension.
For this we need to check whether a name exists already, and to keep track of duplicate ones, while building the hash, so this will take a little longer. There is a little conundrum of how to account for duplicate names then? I use another hash where only duped-names‡ are kept, in arrayrefs; this simplifies and speeds up both parts of the job.
my (%fqn, %dupe_names);
find( sub {
return if -d;
(exists $fqn{$_})
? push( @{ $dupe_names{$_} }, $File::Find::name )
: ( $fqn{$_} = $File::Find::name );
}, $source_dir );
To my surprise this runs barely a little slower than the code with no concern for duplicate names, on a quarter million files spread over a sprawling hierarchy, even as now a test runs for each item.
The parens around the assignment in the ternary operator are needed since the operator may be assigned to (if the last two arguments are valid "lvalues," as they are here) and so one need be careful with assignments inside the branches.
Then after copying %fqn
as in the main part of the post, also copy other files with the same name. We need to break up filenames so to add enumeration before .ext
; I use core File::Basename
use File::Basename qw(fileparse);
foreach my $fname (@files) {
next if not exists $dupe_names{$fname}; # no dupe (and copied already)
my $cnt = 1;
foreach my $fqn (@{$dupe_names{$fname}}) {
my ($name, $path, $ext) = fileparse($fqn, qr/\.[^.]*/);
copy $fqn, "$copy_to_dir/${name}_$cnt$ext";
or do {
warn "Can't copy $fqn to $copy_to_dir: $!";
next;
};
++$cnt;
}
}
(basic testing done but not much more)
I'd perhaps use undef
instead of $path
above, to indicate that the path is unused (while that also avoids allocating and populating a scalar), but I left it this way for clarity for those unfamiliar with what the module's sub returns.
Note. For files with duplicates there'll be copies fname.ext
, fname_1.ext
, etc. If you'd rather have them all indexed, then first rename fname.ext
(in the destination, where it has already been copied via %fqn
) to fname_1.ext
, and change counter initialization to my $cnt = 2;
.
‡ Note that these by no means need be same files.