Find thousands of files efficiently with exact match from a directory containing millions of files (bash/python/perl)

Question

I am on Linux and I am trying to find thousands of files from a directory (SOURCE_DIR) that contain millions of files. I have a list of file names that I need to find, stored in a single text file (FILE_LIST). Each line of this file contain a single name corresponding to a file in SOURCE_DIR and there are thousands of lines in the file.

## FILE_LIST contain single word file names, each per line
#Name0001
#Name0002
#..
#Name9999

I want to copy the files to another directory (DESTINATION_DIR). I wrote the below loop, with a loop inside to find one by one.

#!/bin/bash
FILE_LIST='file.list'
## FILE_LIST contain single word file names, each per line
#Name0001
#Name0002
#..
#Name9999

SOURCE_DIR='/path/to/source/files' # Contain millions of files in sub-directories
DESTINATION_DIR='/path/to/destination/files' # Files will be copied to here


while read FILE_NAME
do
    echo $FILE_NAME
    for FILE_NAME_WITH_PATH in `find SOURCE_DIR -maxdepth 3 -name "$FILE_NAME*" -type f -exec readlink -f {} \;`; 
    do 
        echo $FILE
        cp -pv $FILE_NAME_WITH_PATH $DESTINATION_DIR; 
    done
done < $FILE_LIST

This loop is taking a lot of time and I was wondering whether there is a better way to achieve my goal. I searched, but did not find a solution to my problem. Please direct to me to a solution if already exist or kindly suggest any tweak in the above code. I am also fine if another approach or even a python/perl solution. Thanks for your time and help!

This [Python multiprocess/multithreading to speed up file copying](https://stackoverflow.com/questions/44320331/python-multiprocess-multithreading-to-speed-up-file-copying) may be useful. Spencer answer for instance claims a 8x improvement on his/her system but warns mileage may vary. — DarrylG, May 16 '20 at 20:37
Huh? Why do you need to find them if you know their names? Just copy them by name - in parallel with **GNU Parallel** if you have decent disks, — Mark Setchell, May 16 '20 at 20:37
@DarrylG, Thanks for the python suggestion. I will look into it — Insilico, May 16 '20 at 20:41
@MarkSetchell, the files are in different sub-directories. Do not know which one. So I have to find it — Insilico, May 16 '20 at 20:42
Oh, your question implies they are all in one directory to my mind. — Mark Setchell, May 16 '20 at 20:49
Why do you want to do this? What do you plan to do next? Are the files large - if so, you could just make symlinks to the files rather than duplicate the (potentially voluminous) content. — Mark Setchell, May 16 '20 at 20:55
A problem: what to do if there is the same filename in different directories? — zdim, May 16 '20 at 20:58
Finally, the perl solution by zdim was the fastest. I also figured out a bash one-liner which worked fine based on the 'grep' suggestions here. $ find $SOURCE_DIR -type f -print0 | grep -zFf $FILE_LIST| xargs -0 -I {} cp {} --backup=t $DESTINATION_DIR # real 2m21.254s # user 0m9.732s # sys 0m37.473s — Insilico, May 18 '20 at 08:11
Renamed files to keep original file extension with command below $rename 's/((?:\..+)?)\.~(\d+)~$/_$2$1/' *.~*~ #ubuntu 16.04 — Insilico, May 18 '20 at 08:24

zdim · Accepted Answer · 2020-06-28T06:44:16.540

Note Code to handle same names in different directories added below

The files to copy need to be found as they aren't given with a path (don't know in which directories they are), but searching anew for each is extremely wasteful, increasing complexity greatly.

Instead, build a hash with a full-path name for each filename first.

One way, with Perl, utilizing the fast core module File::Find

use warnings;
use strict;
use feature 'say';

use File::Find;
use File::Copy qw(copy);

my $source_dir = shift // '/path/to/source';  # give at invocation or default

my $copy_to_dir = '/path/to/destination';

my $file_list = 'file_list_to_copy.txt';  
open my $fh, '<', $file_list or die "Can't open $file_list: $!";
my @files = <$fh>;
chomp @files;


my %fqn;    
find( sub { $fqn{$_} = $File::Find::name  unless -d }, $source_dir );

# Now copy the ones from the list to the given location        
foreach my $fname (@files) { 
    copy $fqn{$fname}, $copy_to_dir  
        or do { 
            warn "Can't copy $fqn{$fname} to $copy_to_dir: $!";
            next;
        };
}

The remaining problem is about filenames that may exists in multiple directories, but we need to be given a rule for what to do then.^†

I disregard that a maximal depth is used in the question, since it is unexplained and seemed to me to be a fix related to extreme runtimes (?). Also, files are copied into a "flat" structure (without restoring their orginal hierarchy), taking the cue from the question.

Finally, I skip only directories, while various other file types come with their own issues (copying links around needs care). To accept only plain files change unless -d to if -f.

^† A clarification came that, indeed, there may be files with the same name in different directories. Those should be copied to same name suffixed with a sequential number before the extension.

For this we need to check whether a name exists already, and to keep track of duplicate ones, while building the hash, so this will take a little longer. There is a little conundrum of how to account for duplicate names then? I use another hash where only duped-names^‡ are kept, in arrayrefs; this simplifies and speeds up both parts of the job.

my (%fqn, %dupe_names);
find( sub {
    return if -d;
    (exists $fqn{$_})
        ? push( @{ $dupe_names{$_} }, $File::Find::name )
        : ( $fqn{$_} = $File::Find::name );
}, $source_dir );

To my surprise this runs barely a little slower than the code with no concern for duplicate names, on a quarter million files spread over a sprawling hierarchy, even as now a test runs for each item.

The parens around the assignment in the ternary operator are needed since the operator may be assigned to (if the last two arguments are valid "lvalues," as they are here) and so one need be careful with assignments inside the branches.

Then after copying %fqn as in the main part of the post, also copy other files with the same name. We need to break up filenames so to add enumeration before .ext; I use core File::Basename

use File::Basename qw(fileparse);

foreach my $fname (@files) { 
    next if not exists $dupe_names{$fname};  # no dupe (and copied already)
    my $cnt = 1;
    foreach my $fqn (@{$dupe_names{$fname}}) { 
        my ($name, $path, $ext) = fileparse($fqn, qr/\.[^.]*/); 
        copy $fqn, "$copy_to_dir/${name}_$cnt$ext";
            or do { 
                warn "Can't copy $fqn to $copy_to_dir: $!";
                next;
            };
        ++$cnt;
    }
}

(basic testing done but not much more)

I'd perhaps use undef instead of $path above, to indicate that the path is unused (while that also avoids allocating and populating a scalar), but I left it this way for clarity for those unfamiliar with what the module's sub returns.

Note. For files with duplicates there'll be copies fname.ext, fname_1.ext, etc. If you'd rather have them all indexed, then first rename fname.ext (in the destination, where it has already been copied via %fqn) to fname_1.ext, and change counter initialization to my $cnt = 2;.

^‡ Note that these by no means need be same files.

Thank you very much!. This works. Took just above 1 min! As you suspected, I do have problem with duplicate files. Could you also suggest a solution to rename files already copied by suffixing a sequential number before file extension? Flat structure of file path without hierarchy is ok for me. I also figured out a bash one liner (posted below) which took more than 2 mins. — Insilico, May 18 '20 at 08:08
@Insilico Added code to handle duplicates. (Also added a condition to skip directories, what I presume should be done.) To my surprise, this doesn't actually run much slower at all. Note that with so many files this can be sped up a lot by using multiple processes; but then the code is _much_ more complex. — zdim, May 19 '20 at 07:00
@Insilico How did this go? It worked in my tests but if there are issues please let me know — zdim, May 28 '20 at 21:05
Thank you very much for your revision! I apologize in my delayed response. We are working on a COVID-19 drug discovery research project and things are quite hectic. The millions of files I mentioned in my question are digital representation of chemical compounds, from which we are looking for a few that can destroy the virus, but safe for human. Your help is much appreciated. I will get back to your revised code, but the old code was useful even with the limitations. Thanks again! — Insilico, Jun 26 '20 at 08:06
@Insilico Wow -- best of luck with your project! (I think I can say that I speak for _many_ here :) Thank you for the explanation :). As always, please let me know if questions/issues pop up (but now even more so) — zdim, Jun 28 '20 at 06:48

score 2 · Answer 2 · answered May 16 '20 at 21:32

I suspect the speed issues are (at least partly) coming from your nested loops - for every FILE_NAME, you're running a find and looping over its results. The following Perl solution uses the technique of dynamically building a regular expression (which works for large lists, I've tested it on lists of 100k+ words to match), that way you only need to loop over the files once and let the regular expression engine handle the rest; it's quite fast.

Note I have made a couple of assumptions based on my reading of your script: That you want the patterns to match case-sensitively at the beginning of filenames, and that you want to recreate the same directory structure as the source in the destination (set $KEEP_DIR_STRUCT=0 if you do not want this). Also, I am using the not-exactly-best-practice solution of shelling out to find instead of using Perl's own File::Find because it makes it easier to implement the same options you're using (such as -maxdepth 3) - but it should work fine unless there are any files with newlines in their name.

This script uses only core modules so you should already have them installed.

#!/usr/bin/env perl
use warnings;
use strict;
use File::Basename qw/fileparse/;
use File::Spec::Functions qw/catfile abs2rel/;
use File::Path qw/make_path/;
use File::Copy qw/copy/;

# user settings
my $FILE_LIST='file.list';
my $SOURCE_DIR='/tmp/source';
my $DESTINATION_DIR='/tmp/dest';
my $KEEP_DIR_STRUCT=1;
my $DEBUG=1;

# read the file list
open my $fh, '<', $FILE_LIST or die "$FILE_LIST: $!";
chomp( my @files = <$fh> );
close $fh;

# build a regular expression from the list of filenames
# explained at: https://www.perlmonks.org/?node_id=1179840
my ($regex) = map { qr/^(?:$_)/ } join '|', map {quotemeta}
    sort { length $b <=> length $a or $a cmp $b } @files;

# prep dest dir
make_path($DESTINATION_DIR, { verbose => $DEBUG } );

# use external "find"
my @cmd = ('find',$SOURCE_DIR,qw{ -maxdepth 3 -type f -exec readlink -f {} ; });
open my $cmd, '-|', @cmd or die $!;
while ( my $srcfile = <$cmd> ) {
    chomp($srcfile);
    my $basename = fileparse($srcfile);
    # only interested in files that match the pattern
    next unless $basename =~ /$regex/;
    my $newname;
    if ($KEEP_DIR_STRUCT) {
        # get filename relative to the source directory
        my $relname = abs2rel $srcfile, $SOURCE_DIR;
        # build new filename in destination directory
        $newname = catfile $DESTINATION_DIR, $relname;
        # create the directories in the destination (if necessary)
        my (undef, $dirs) = fileparse($newname);
        make_path($dirs, { verbose => $DEBUG } );
    }
    else {
        # flatten the directory structure
        $newname = catfile $DESTINATION_DIR, $basename;
        # warn about potential naming conflicts
        warn "overwriting $newname with $srcfile\n" if -e $newname;
    }
    # copy the file
    print STDERR "cp $srcfile $newname\n" if $DEBUG;
    copy($srcfile, $newname) or die "copy('$srcfile', '$newname'): $!";
}
close $cmd or die "external command failed: ".($!||$?);

You may also want to consider possibly using hard links instead of copying the files.

You are absolutely right. The loop inside loop was a bad idea. Thanks for sharing your thoughts. Also your suggestion on hard links is quite useful. — Insilico, May 18 '20 at 08:16

Jetchisel · Answer 3 · 2020-05-17T06:27:25.940

here is bashv4+ solution with find, not sure about the speed though.

#!/usr/bin/env bash

files=file.list
sourcedir=/path/to/source/files
destination=/path/to/destination/files
mapfile -t lists < "$files"
total=${#lists[*]}

while IFS= read -rd '' files; do
  counter=0
  while ((counter < total)); do
    if [[ $files == *"${lists[counter]}" ]]; then
      echo cp -v "$files" "$destination" && unset 'lists[counter]' && break
    fi
    ((counter++))
  done
  lists=("${lists[@]}")
  total=${#lists[*]}
  (( ! total )) && break  ##: if the lists is already emtpy/zero, break.
done < <(find "$sourcedir" -type f -print0)

The inner break will exit the inner loop if a match was found in the file.list and the files in the source_directory, so it will not process the file.list until the end, and it removes the entry in the "${lists[@]}" (which is an array) with the unset, so the next inner loop will skip the already matched files.
File name collision should not be a problem, the unset and the inner break makes sure of that. The down side is if you have multiple files to match in different sub directories.
If speed is what you're looking for then use the general scripting languages like, python, perl and friends

An alternative to the (excruciating slow) pattern match inside the loop is grep

#!/usr/bin/env bash

files=file.list
source_dir=/path/to/source/files
destination_dir=/path/to/destination/files

while IFS= read -rd '' file; do
  cp -v "$file" "$destination_dir"
done < <(find "$source_dir" -type f -print0 | grep -Fzwf "$files")

The -z from grep being a GNU extension.
Remove the echo if you think the output is correct.

Thanks for your suggestion. Actually, after a bit of search and trial and error, I was able to do with a one liner which took just above 2 mins. I will post it. — Insilico, May 18 '20 at 08:06

score 1 · Answer 4 · answered May 17 '20 at 01:09

With `rsync`

I have no idea how fast this will be for millions of files but here's a method that uses rsync.

Format your file.list as below (ex: such as with $ cat file.list | awk '{print "+ *" $0}').

+ *Name0001
+ *Name0002
...
+ *Name9999

Call file.list with --include=from option in rsync command:

$ rsync -v -r --dry-run --filter="+ **/" --include-from=/tmp/file.list --filter="- *" /path/to/source/files /path/to/destination/files

Option explanations:

-v                  : Show verbose info.
-r                  : Traverse directories when searching for files to copy.
--dry-run           : Remove this if preview looks okay
--filter="+ *./"    : Pattern to include all directories in search
--include-from=/tmp/file.list  : Include patterns from file.
--filter="- *"      : Exclude everything that didn't match previous patterns.

Option order matters.

Remove --dry-run if the verbose info looks acceptable.

Tested with rsync version 3.1.3.

Thank you for the rsync solution. However, this took more than 4 hrs compared to perl/bash solutions which took about 2 mins. Advantage here is if I wanted to keep the original file hierarchy. All files are copied to the corresponding source file structure. — Insilico, May 18 '20 at 08:28

score 0 · Answer 5 · answered May 16 '20 at 20:52

0

Try locate with grep instead of find. I uses file index db and thus should be pretty fast. Remember to run sudo updatedb to update the db beforehand.

answered May 16 '20 at 20:52

Touten

190
12

Find thousands of files efficiently with exact match from a directory containing millions of files (bash/python/perl)

5 Answers5

With `rsync`

Linked

Find thousands of files efficiently with exact match from a directory containing millions of files (bash/python/perl)

5 Answers5

With rsync

Linked

With `rsync`