2

I saw this link using glob

It's not quite what I want to do though.

Here is my plan. In order to search though a directory for any files that partially match a string, given to my function as a parameter, say /home/username/sampledata and the string, say data.

I give the option for the user to include a flag at execution enforcing whether or not to check subdirectories, and currently by default the script does not include subdirectories.

The pseudocode for the one that does include the subdirectories would look like this.

The array that I am saving the file paths to is global

  @fpaths;

  foo($dir);

  sub foo{
      get a tmp array of all files

      for ($i=0 ; $i<@tmp ; $i++) {
          next if ( $tmp[$i]is a hidden file and !$hidden) ; #hidden is a flag too

          if($tmp[$i] is file) {
               push (@fpaths, $dir.$tmp[$i]);
          }
          if($tmp[$i] is dir) {
               foo($dir.$tmp[$i]);
          }

       }
   }

That looks pretty solid.

What I'm hoping to achieve is an array of every file with the full path name saved.

The part I do not know how to do is get the list of every file. Hopefully this can be done with glob.

I have been able to use opendir/readdir to read every file and I could do this again if I knew how to check if the result is a file or a directory.

So my questions are:

  1. How to use glob with a path name to get an array of every file/sub directory

  2. How to check if an item on the formerly found array is a directory or a file

Thanks everybody

Community
  • 1
  • 1
Chris Topher
  • 86
  • 1
  • 1
  • 7
  • 1
    `for ($i=0 ; $i<@tmp ; $i++) { ... }` is conventionally written `for my $i (0 .. $#tmp) { ... }` – Borodin May 23 '13 at 00:38

5 Answers5

9

I would use File::Find

Note that File::Find::name is the complete path to the given file. Which would include directories, since they are also files.

This is just a sample for the reader to figure out the rest of the details.

use warnings;
use strict;
use File::Find;

my $path = "/home/cblack/tests";

find(\&wanted, $path);

sub wanted {
   return if ! -e; 

   print "$File::Find::name\n" if $File::Find::name =~ /foo/;
   print "$File::Find::dir\n" if $File::Find::dir =~ /foo/;
}

Better yet, if you want to push all these to a list you can do it like so:

use File::Find;

main();

sub main {
    my $path = "/home/cblack/Misc/Tests";
    my $dirs = [];
    my $files= [];
    my $wanted = sub { _wanted($dirs, $files) };

    find($wanted, $path);
    print "files: @$files\n";
    print "dirs: @$dirs\n";
}

sub _wanted {
   return if ! -e; 
   my ($dirs, $files) = @_;

   push( @$files, $File::Find::name ) if $File::Find::name=~ /foo/;
   push( @$dirs, $File::Find::dir ) if $File::Find::dir =~ /foo/;
}
chrsblck
  • 3,948
  • 2
  • 17
  • 20
  • Not quite understanding how you are navigating through a directory, there will likely be a few hundred files found and a dozen random unknown subdirectories. if any files or files in the subdirectory have a name containing a certain string i want an entire path saved to an array – Chris Topher May 23 '13 at 03:13
  • @ChrisTopher I linked `File::Find` in the first line of my post. You should read it. In fact, the first line in the Module's Description reads: "These are functions for searching through directory trees doing work on each file found similar to the Unix find command." And then goes on to describe the function in my post, `find`. Why don't you create a small sandbox for yourself to play around in? – chrsblck May 23 '13 at 03:43
  • Will experiment, Thank you – Chris Topher May 23 '13 at 04:46
  • 1
    @ChrisTopher Here's an [example](http://stackoverflow.com/questions/16671127/perl-directory-walking-issue-cant-go-back-up-more-than-1-directory-properly/16672622#16672622) doing this recursively without `File::Find`. In case you're curious. – chrsblck May 23 '13 at 05:14
3
  • I don't see why glob solves your problem of how to check whether a directory entry is a file or a directory. If you've been using readdir before then stick with it

  • Don't forget you have to handle links carefully, otherwise your recursion may never end

  • Also remember that readdir returns . and .. as well as the real directory contents

  • Use -f and -d to check whether a node name is a file or a directory, but remember that if its loaction isn't your current working directory then you have to fully-qualify it by adding the path, otherwise you'll be talking about a completely different node that probably doesn't exist

  • Unless this is a learning experience, you are much better off writing something ready-rolled and tested, like File::Find

Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Yes, I am checking for the Hidden files first they will need to be excluded in the sub directory check as well. if I had a fully qualified name of a file how do i do a -d/-f check on it? would it be like: if ($path -d) print this is a directory? – Chris Topher May 23 '13 at 03:05
  • 1
    @ChrisTopher: [Links are nothing to do with hidden files.](http://www.cyberciti.biz/tips/understanding-unixlinux-symbolic-soft-and-hard-links.html) Read the reference I linked to in my answer, or the reference you linked to in your question, to see how to use `-f` and -d`. – Borodin May 23 '13 at 12:15
  • That's exactly what I was looking for, I couldn't tell they were links, Hah, my recursion works, just not adding to the array properly. Thanks a bunch. Just started learning Perl on Tuesday, cool scripting language. So I appreciate some direction – Chris Topher May 23 '13 at 17:48
3

Inspired by Nima Soroush's answer, here's a generalized recursive globbing function similar to Bash 4's globstar option that allows matching across all levels of a subtree with **.

Examples:

# Match all *.txt and *.bak files located anywhere in the current
# directory's subtree.
globex '**/{*.txt,*.bak}' 

# Find all *.pm files anywhere in the subtrees of the directories in the
# module search path, @INC; follow symlinks.
globex '{' . (join ',', @INC) . '}/**/*.pm', { follow => 1 }

Note: While this function, which combines File::Find with the built-in glob function, probably mostly works as you expect if you're familiar with glob's behavior, there are many subtleties around sorting and symlink behavior - please see the comments at the bottom.

A notably deviation from glob() is that whitespace in a given pattern argument is considered part of the pattern; to specify multiple patterns, pass them as separate pattern arguments or use a brace expression, as in the example above.

Source code

sub globex {

  use File::Find;
  use File::Spec;
  use File::Basename;
  use File::Glob qw/bsd_glob GLOB_BRACE GLOB_NOMAGIC GLOB_QUOTE GLOB_TILDE GLOB_ALPHASORT/;

  my @patterns = @_;
  # Set the flags to use with bsd_glob() to emulate default glob() behavior.
  my $globflags = GLOB_BRACE | GLOB_NOMAGIC | GLOB_QUOTE | GLOB_TILDE | GLOB_ALPHASORT;
  my $followsymlinks;
  my $includehiddendirs;
  if (ref($patterns[-1]) eq 'HASH') {
    my $opthash = pop @patterns;
    $followsymlinks = $opthash->{follow};
    $includehiddendirs = $opthash->{hiddendirs};
  }
  unless (@patterns) { return };

  my @matches;
  my $ensuredot;
  my $removedot;
  # Use fc(), the casefolding function for case-insensitive comparison, if available.
  my $cmpfunc = defined &CORE::fc ? \&CORE::fc : \&CORE::lc;

  for (@patterns) {
    my ($startdir, $anywhereglob) = split '(?:^|/)\*\*(?:/|$)';
    if (defined $anywhereglob) {  # recursive glob
      if ($startdir) {
        $ensuredot = 1 if m'\./'; # if pattern starts with '.', ensure it is prepended to all results
      } elsif (m'^/') { # pattern starts with root dir, '/'
        $startdir = '/';
      } else { # pattern starts with '**'; must start recursion with '.', but remove it from results
        $removedot = 1;
        $startdir = '.';
      }
      unless ($anywhereglob) { $anywhereglob = '*'; }
      my $terminator = m'/$' ? '/' : '';
      # Apply glob() to the start dir. as well, as it may be a pattern itself.
      my @startdirs = bsd_glob $startdir, $globflags or next;
      find({
          wanted => sub {
            # Ignore symlinks, unless told otherwise.
            unless ($followsymlinks) { -l $File::Find::name and return; }
            # Ignore non-directories and '..'; we only operate on 
            # subdirectories, where we do our own globbing.
            ($_ ne '..' and -d) or return;
            # Skip hidden dirs., unless told otherwise.
            unless ($includehiddendirs) {  return if basename($_) =~ m'^\..'; }
            my $globraw;
            # Glob without './', if it wasn't part of the input pattern.
            if ($removedot and m'^\./(.+)$') { 
              $_ = $1;
            }
            $globraw = File::Spec->catfile($_, $anywhereglob);
            # Ensure a './' prefix, if the input pattern had it.
            # Note that File::Spec->catfile() removes it.
            if($ensuredot) {
              $globraw = './' . $globraw if $globraw !~ m'\./';
            }
            push @matches, bsd_glob $globraw . $terminator, $globflags;
          },
          no_chdir => 1,
          follow_fast => $followsymlinks, follow_skip => 2,
          # Pre-sort the items case-insensitively so that subdirs. are processed in sort order.
          # NOTE: Unfortunately, the preprocess sub is only called if follow_fast (or follow) are FALSE.
          preprocess => sub { return sort { &$cmpfunc($a) cmp &$cmpfunc($b) } @_; }
        }, 
        @startdirs);
    } else {  # simple glob
      push @matches, bsd_glob($_, $globflags);
    }
  }
  return @matches;
}

Comments

SYNOPSIS
  globex PATTERNLIST[, \%options]

DESCRIPTION
  Extends the standard glob() function with support for recursive globbing.
  Prepend '**/' to the part of the pattern that should match anywhere in the
  subtree or end the pattern with '**' to match all files and dirs. in the
  subtree, similar to Bash's `globstar` option.

  A pattern that doesn't contain '**' is passed to the regular glob()
  function.
  While you can use brace expressions such as {a,b}, using '**' INSIDE
  such an expression is NOT supported, and will be treated as just '*'.
  Unlike with glob(), whitespace in a pattern is considered part of that
  pattern; use separate pattern arguments or a brace expression to specify
  multiple patterns.

  To also follow directory symlinks, set 'follow' to 1 in the options hash
  passed as the optional last argument.
  Note that this changes the sort order - see below.

  Traversal:
  For recursive patterns, any given directory examined will have its matches
  listed first, before descending depth-first into the subdirectories.

  Hidden directories:
  These are skipped by default, onless you set 'hiddendirs' to 1 in the
  options hash passed as the optional last argument.

  Sorting:
  A given directory's matching items will always be sorted
  case-insensitively, as with glob(), but sorting across directories
  is only ensured, if the option to follow symlinks is NOT specified.

  Duplicates:
  Following symlinks only prevents cycles, so if a symlink and its target
  they will both be reported.
  (Under the hood, following symlinks activates the following 
   File::Find:find() options: `follow_fast`, with `follow_skip` set to 2.)

  Since the default glob() function is at the heart of this function, its
  rules - and quirks - apply here too:
  - If literal components of your patterns contain pattern metacharacters,
    - * ? { } [ ] - you must make sure that they're \-escaped to be treated
    as literals; here's an expression that works on both Unix and Windows
    systems: s/[][{}\-~*?]/\\$&/gr
  - Unlike with glob(), however, whitespace in a pattern is considered part
    of the pattern; to specify multiple patterns, use either a brace
    expression (e.g., '{*.txt,*.md}'), or pass each pattern as a separate
    argument.
  - A pattern ending in '/' restricts matches to directories and symlinks
    to directories, but, strangely, also includes symlinks to *files*.
  - Hidden files and directories are NOT matched by default; use a separate
    pattern starting with '.' to include them; e.g., globex '**/{.*,*}'
    matches all files and directories, including hidden ones, in the 
    current dir.'s subtree.
    Note: As with glob(), .* also matches '.' and '..'
  - Tilde expansion is supported; escape as '\~' to treat a tilde as the
    first char. as a literal.
 -  A literal path (with no pattern chars. at all) is echoed as-is, 
    even if it doesn't refer to an existing filesystem item.

COMPATIBILITY NOTES
  Requires Perl v5.6.0+
  '/' must be used as the path separator on all platforms, even on Windows.

EXAMPLES
  # Find all *.txt files in the subtree of a dir stored in $mydir, including
  # in hidden subdirs.
  globex "$mydir/*.txt", { hiddendirs => 1 };

  # Find all *.txt and *.bak files in the current subtree.
  globex '**/*.txt', '**/*.bak'; 

  # Ditto, though with different output ordering:
  # Unlike above, where you get all *.txt files across all subdirs. first,
  # then all *.bak files, here you'll get *.txt files, then *.bak files
  # per subdirectory encountered.
  globex '**/{*.txt,*.bak}';

  # Find all *.pm files anywhere in the subtrees of the directories in the
  # module search path, @INC; follow symlinks.
  # Note: The assumption is that no directory in @INC has embedded spaces
  #       or contains pattern metacharacters.
  globex '{' . (join ',', @INC) . '}/**/*.pm', { follow => 1 };
Community
  • 1
  • 1
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 2
    why not throw this on cpan? I can't find a similar module – mikew Aug 29 '15 at 06:33
  • Thanks for the suggestion, @mikew. Making this CPAN-ready requires quite a bit more work and thus time, which I don't have right now, but I'll keep it in mind. – mklement0 Aug 30 '15 at 02:31
1

you can use this method as recursive file search that separate specific file types,

my @files;
push @files, list_dir($outputDir);

sub list_dir {
        my @dirs = @_;
        my @files;
        find({ wanted => sub { push @files, glob "\"$_/*.txt\"" } , no_chdir => 1 }, @dirs);
        return @files;
}
Nima Soroush
  • 12,242
  • 4
  • 52
  • 53
  • 1
    Kudos for a clever combination of `File::Find` and `glob`. Two subtleties: you're globbing on _every_ invocation of the `wanted` subroutine, which means you're needlessly invoking `glob()` on _files_ too, not just directories. Conversely, if the item at hand is a _symlink_ to a directory, you're globbing its contents as well (but not recursively), which may be unexpected, given that you're not configuring `find` to follow symlinks. If you don't need to follow symlinks, you could do your globbing in the `preprocess` subroutine instead, and return an empty list from it. – mklement0 Aug 26 '15 at 06:00
  • 1
    @mklement0 : Thanks for the subtleties. Your are absolutely right about it – Nima Soroush Aug 26 '15 at 13:55
0

I tried implementing this by using only readdir. I leave my code here in case it is useful to anyone:

sub rlist_files{
    my @depth = ($_[0],);
    my @files;
    while ($#depth > -1){
        my $dir = pop(@depth);
        opendir(my $dh, $dir) || die "Can't open $dir: $!";
        while (readdir $dh){
            my $entry = "$dir/$_";
            if (!($entry =~ /\/\.+$/)){
                if (-f $entry){
                    push(@files,$entry);
                }
                elsif (-d $entry){
                    push(@depth, $entry);
                }
            }
        }
        closedir $dh;
    }
    return @files;
}

EDIT: as nicely indicated by @brian d foy, that code is not taking into account symlinks at all.

As an exercise I have tried to write a new sub capable of following symlinks recursively (optional) without falling in loops and with a somehow limited use of memory (using hashes to keep track of the visited symlinks was using several GB in large runs). As I was on it, I also added the option of passing a regex to filter files. Again, I leave my code here in case it is useful to anyone:

sub rlist_files_nohash{
    use Cwd qw(abs_path);
    my $input_path = abs_path($_[0]);
    if (!defined $input_path){
        die "Cannot find $_[0]."
    }
    my $ignore_symlinks = 0;
    if ($#_>=1){
        $ignore_symlinks = $_[1];
    }
    my $regex;
    if ($#_==2){
        $regex = $_[2];
    }   
    my @depth = ($input_path,);
    my @files;
    my @link_dirs;
    while ($#depth > -1){
        my $dir = pop(@depth);
        opendir(my $dh, $dir) or die "Can't open $dir: $!";
        while (readdir $dh){
            my $entry = "$dir/$_";
            if (!($entry =~ /\/\.+$/)){
                if (-l $entry){
                    if ($ignore_symlinks){
                        $entry = undef;
                    }
                    else{
                        while (defined $entry && -l $entry){
                            $entry = readlink($entry);
                            if (defined $entry){
                                if (substr($entry, 0, 1) ne "/"){
                                    $entry = $dir."/".$entry;
                                }
                                $entry = abs_path($entry);
                            }
                        }
                        if (defined $entry && -d $entry){
                            if ($input_path eq substr($entry,0,length($input_path))){
                                $entry = undef;
                            }
                            else{
                                for (my $i = $#link_dirs;($i >= 0 && defined $entry); $i--){
                                    if (length($link_dirs[$i]) <= length($entry) && $link_dirs[$i] eq substr($entry,0,length($link_dirs[$i]))){
                                        $entry = undef;
                                        $i = $#link_dirs +1;
                                    }
                                }
                                if(defined $entry){
                                    push(@link_dirs, $entry);
                                }
                            }
                        }
                    }
                }
                if (defined $entry){
                    if (-f $entry && (!defined $regex || $entry =~ /$regex/)){
                        push(@files, abs_path($entry));
                    }
                    elsif (-d $entry){
                        push(@depth, abs_path($entry));
                    }
                }
            }
        }
        closedir $dh;
    }
    if ($ignore_symlinks == 0){
        @files = sort @files;
        my @indices = (0,);
        for (my $i = 1;$i <= $#files; $i++){
            if ($files[$i] ne $files[$i-1]){
                push(@indices, $i);
            }
        }
        @files = @files[@indices];
    }
    return @files;
}
#Testing
my $t0 = time();
my @files = rlist_files_nohash("/home/user/", 0, qr/\.pdf$/);
my $tf = time() - $t0;
for my file(@files){
    print($file."\n");
}
print ("Total files found: ".scalar @files."\n");
print ("Execution time: $tf\n");
IMA
  • 1
  • 1
  • Remember to check for symlinks, otherwise you might end up searching the same directory again. Trust me, I know :) – brian d foy Apr 18 '21 at 05:58
  • Thank you! You are absolutely right, that code neglects symlinks altogether. Trying to follow symlinks recursively avoiding loops has been a good learning experience. I will edit the answer accordingly. – IMA Apr 19 '21 at 09:52