2

The following excerpt code, when running on perl 5.16.3 and older versions, has a strange behavior, where subsequent calls to a glob in the line input operator causes the glob to continue returning previous values, rather than running the glob anew.

#!/usr/bin/env perl

use strict;
use warnings;

my @dirs = ("/tmp/foo", "/tmp/bar");

foreach my $dir (@dirs) {    
    my $count = 0;
    my $glob = "*";
    print "Processing $glob in $dir\n";
    while (<$dir/$glob>) {
        print "Processing file $_\n";
        $count++;
        last if $count > 0;
    }
}

If you put two files in /tmp/foo and one or more in /tmp/bar, and run the code, I get the following output:

Processing * in /tmp/foo

Processing file /tmp/foo/foo.1

Processing * in /tmp/bar

Processing file /tmp/foo/foo.2

I thought that when the while terminates after the last, that the new invocation of the while on the second iteration would re-run the glob and give me the files listed /tmp/bar, but instead I get a continuation of what's in /tmp/foo.

It's almost like the angle operator glob is acting like a precompiled pattern. My hypothesis is that the angle operator is creating a filehandle in the symbol table that's still open and being reused behind the scenes, and that it's scoped to the containing foreach, or possibly the whole subroutine.

CDahn
  • 1,795
  • 12
  • 23
  • And before the comments come in, yes, if I remove the angle operator glob and run this with an explicit call to glob() followed by a foreach, I will get the behavior I'm expecting. I'd like to know why this is working like this, not how to fix it. – CDahn Jun 30 '17 at 23:50
  • Re "*where subsequent calls to a glob in the line input operator causes the glob to continue returning previous values*", As it obviously should. It would be useless if `<$dir/$glob>` always returned the first file. – ikegami Jul 01 '17 at 04:14

1 Answers1

5

From I/O Operators in perlop (my emphasis)

A (file)glob evaluates its (embedded) argument only when it is starting a new list. All values must be read before it will start over. In list context, this isn't important because you automatically get them all anyway. However, in scalar context the operator returns the next value each time it's called, or undef when the list has run out.

Since <> is called in scalar context here and you exit the loop with last after the first iteration, the next time you enter it it keeps reading from the original list.


It is clarified in comments that there is a practical need behind this quest: process only some of the files from a directory and never return all filenames since there can be many.

So assigning from glob to a list and working with it, or better yet using for instead of while as commented by ysth, doesn't help here as it returns a huge list.

I haven't found a way to make glob (what <> with a filename pattern uses) drop and rebuild the list once it's generated it, without getting to its end first. Apparently, each instance of the operator gets its own list. So using another <> inside the while loop with the hope of resetting it, in any way and even with the same pattern, doesn't affect the list being iterated over in while (<$glob>).

Just to note, breaking out of the loop with a die (with while in an eval) doesn't help either; the next time we come to that while the same list is continued. Wrapping it in a closure

sub iter_glob { my $dir = shift; return sub { scalar <"$dir/*"> } }

for my $d (@dirs) {
    my $iter = iter_glob($d);
    while (my $f = $iter->()) {
        # ...
    }
}

met with the same fate; the original list keeps being used.

The solution then is to use readdir instead.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • 1
    explicit glob doesn't help at all; easiest fix is to use `for` instead of `while` – ysth Jul 01 '17 at 00:16
  • Okay, so the line operator is indeed creating some global reference that is being reused on the next foreach iteration. I also tried putting this code in a subroutine to force the scope to exit, and the glob is keeping state across repeated invocations of the subroutine. I guess the lesson is to just not do this. – CDahn Jul 01 '17 at 00:31
  • @ysth Indeed, as I indicated above, that's the fix for this. That being said, the reason the code didn't do this in the first place is due to the potentially large number of files the glob could return. The dev was trying to be efficient and not slurp up a lot of memory when only a limited, maximum, number would be processed anyways. But, looks like that strategy isn't going to work. – CDahn Jul 01 '17 at 00:32
  • @ysth By saying `glob` I meant in list context, whereby it will re-build it every time (and then iterate w/ `foreach`). But as you say `foreach` is a far simpler way for that (unless it has an "optimization" to not return the full list, like it does for ranges). – zdim Jul 01 '17 at 03:26
  • @CDahn I didn't think that this had a practical issue behind it, and didn't even think about what to do about it. So you'd need to be able to only process some of what is in directories (and without ever pulling the whole content)? Thank you for the attribution. – zdim Jul 01 '17 at 03:28
  • 1
    @CDahn I tried exiting the `while` early with a `die` (with the `while` loop in `eval`), hoping that it might "reset" it but it still returned to printing content of the first directory ... – zdim Jul 01 '17 at 04:30
  • @zdim Yeah, the original code is processing files in a series of directories, and caps each at 1000 so that no one directory can starve the others. Apparently this safety wasn't ever needed, as it turned out, until just recently. Mass confusion ensued. I'm either going to go to readdir with a grep to replace the glob, or see if a closure with the glob will do the same without needing readdir. – CDahn Jul 01 '17 at 22:45
  • @zdim Just for completeness sake, I also tried creating a class that would contain the glob so that I could explicitly allocate new memory and try to force a new glob to be invoked, but perl is too smart for that. I get the same behavior. It looks like the only way to cap the output efficiently is going to be to move over to readdir with grep. – CDahn Jul 01 '17 at 23:00
  • @CDahn OK, so you really did a lot with this! I tried a closure, no dice; it _still_ keeps reading the same list. I may have not tried "hard enough" ... but `readdir` is clearer, should be more memory efficient, and it works as needed since it iterates through files like `<>` does with lines. Just quit it once you hit your limit and it reads the next directory from `foreach`. (I had tested that). I don't see why `grep`is needed? – zdim Jul 02 '17 at 04:31
  • @zdim the `grep` is needed to mimic what the glob was doing, otherwise you get unwanted files out of `readdir`. – CDahn Jul 02 '17 at 13:09
  • Putting the glob in a closure will work if done correctly, what did you try? – ysth Jul 02 '17 at 17:12
  • @ysth I may need to check that, but: `sub iter_glob { my $dir = shift; return sub { scalar <"$dir/*"> } }; for my $d (@dirs) { my $iter = iter_glob($d); while (my $f = iter->()) {say $f; last} }` -- doesn't, it keeps printing from the same list (the first of `@dirs`) – zdim Jul 02 '17 at 19:18
  • @CDahn I had tested this: `for my $d (@dirs) { opendir my $dh, $d; while (my $entry = readdir($dh)) {next if $entry =~ /^\./; say "$d/$entry"; last} }` and it works as needed. You can add selection rules, to skip items you don't want to process. I don't know how you mean to feed it to `grep` but it imposes list context so if it's `grep { ..} readdir` then `readdir` will return the whole list. – zdim Jul 02 '17 at 19:23
  • @CDahn If you have v5.12 or later (I think) then `readdir` in a `while` loop _does_ set `$_`. This was on v5.10 for which it wasn't doing that yet so I have to assign to a lexical. So on a modern Perl you can do `while (readdir($dh)) { next if /^\./; say "$d/$_";`. If you wish, I mean -- there's nothing wrong with using a lexical, to say the least :) – zdim Jul 02 '17 at 21:01
  • Sorry, I think I was confusing .. with glob; closures work with that but not glob – ysth Jul 02 '17 at 21:11