9

I am running the following expecting return strings of 5 characters:

while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z}'x5) {
  print "$_\n";
}

but it returns only 4 characters:

anbc
anbd
anbe
anbf
anbg
...

However, when I reduce the number of characters in the list:

while (glob '{a,b,c,d,e,f,g,h,i,j,k,l,m}'x5) {
  print "$_\n";
}

it returns correctly:

aamid
aamie
aamif
aamig
aamih
...

Can someone please tell me what I am missing here, is there a limit of some sort? or is there a way around this?

If it makes any difference, It returns the same result in both perl 5.26 and perl 5.28

Gerry
  • 207
  • 2
  • 14
  • Previously: https://stackoverflow.com/a/58852104 https://stackoverflow.com/a/58853045 Use a module providing an iterator instead instead of abusing the glob function. http://p3rl.org/Algorithm::Combinatorics http://p3rl.org/Algorithm::Loops – daxim Nov 16 '19 at 07:50
  • Thanks @daxim. The problem is I am struggling to load modules of any sort right now, I have a cpan issue complaining about Win32::Console, yet ppm is not available in perl 5.28 either so I can load the module for cpan to stop complaining. – Gerry Nov 16 '19 at 08:25
  • Thanks @zdim appreciate all the time and effort. – Gerry Nov 16 '19 at 08:30
  • I just realized ... do you want this shuffled (randomized) at all, or just the complete list? – zdim Nov 16 '19 at 08:55
  • @zdim just a complete list. :) – Gerry Nov 16 '19 at 08:56

2 Answers2

7

The glob first creates all possible file name expansions, so it will first generate the complete list from the shell-style glob/pattern it is given. Only then will it iterate over it, if used in scalar context. That's why it's so hard (impossible?) to escape the iterator without exhausting it; see this post.

In your first example that's 265 strings (11_881_376), each five chars long. So a list of ~12 million strings, with (naive) total in excess of 56Mb ... plus the overhead for a scalar, which I think at minimum is 12 bytes or such. So at the order of a 100Mb's, at the very least, right there in one list.

I am not aware of any formal limits on lengths of things in Perl (other than in regex) but glob does all that internally and there must be undocumented limits -- perhaps some buffers are overrun somewhere, internally? It is a bit excessive.

As for a way around this -- generate that list of 5-char strings iteratively, instead of letting glob roll its magic behind the scenes. Then it absolutely should not have a problem.

However, I find the whole thing a bit big for comfort, even in that case. I'd really recommend to write an algorithm that generates and provides one list element at a time (an "iterator"), and work with that.

There are good libraries that can do that (and a lot more), some of which are Algorithm::Loops recommended in a previous post on this matter (and in a comment), Algorithm::Combinatorics (same comment), Set::CrossProduct from another answer here ...

Also note that, while this is a clever use of glob, the library is meant to work with files. Apart from misusing it in principle, I think that it will check each of (the ~ 12 million) names for a valid entry! (See this page.) That's a lot of unneeded disk work. (And if you were to use "globs" like * or ? on some systems it returns a list with only strings that actually have files, so you'd quietly get different results.)


 I'm getting 56 bytes for a size of a 5-char scalar. While that is for a declared variable, which may take a little more than an anonymous scalar, in the test program with length-4 strings the actual total size is indeed a good order of magnitude larger than the naively computed one. So the real thing may well be on the order of 1Gb, in one operation.

Update   A simple test program that generates that list of 5-char long strings (using the same glob approach) ran for 15-ish minutes on a server-class machine and took 725 Mb of memory.

It did produce the right number of actual 5-char long strings, seemingly correct, on this server.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • @Gerry First, I am not sure that the problem is with limits; looking into it... Perhaps generate the list first, iteratively (not all at once), and store it in a proper array? That surely won't get anywhere near any limits, a "handful" of 5-char strings. (It's also diagnostic --- if that works then it's indeed some internal limit.) – zdim Nov 16 '19 at 06:48
  • @Gerry Don't need modules --- just build the list (of five-char strings) into an array first, piece by piece, instead of lumping it together using `glob`. (That will need some simple-minded, other algorithm. Perhaps what I posted in your previous question? That's good debugging -- if you can get that list without issues then you know that limits are being pushed here.) I added some size estimates that I'm getting to the post... – zdim Nov 16 '19 at 07:01
  • @Gerry `time perl -MDevel::Size=total_size -wE'$chs = join ",", "a".."z"; @items = glob "{$chs}"x5; say STDERR "Total memory: ", total_size(\@items)/(1024**2), " Mb"` ... and let me check ... now it ran in 30 seconds, what confirms it given how the cacheing works here. I also checked RSS with external tools while it was going. – zdim Nov 16 '19 at 07:31
  • @Gerry Same behavior on v5.29.2 (~600Mb now) ... still riding on that cache on this server :))) – zdim Nov 16 '19 at 08:15
  • @Gerry Result from another server class machine, with v5.16 -- 28 minutes (underestimated while it was going!) and 750Mb. Now reran under 5.29.2 and again ~600Mb. Correct strings, and correct number of them (exactly `26**5`) – zdim Nov 16 '19 at 08:18
  • Hold the phone... Seems like this is a perl windows issue or windows caching itself. I just ran it on my Kali Linux VM, on the same laptop and it works 100% it loads 5 char strings. – Gerry Nov 16 '19 at 08:23
  • Thanks. Yes we can keep these important comments I think. – Gerry Nov 16 '19 at 08:59
  • The list couldn't be on the stack, since it needs to survive between calls to `glob`. (And a stack overflow would kill the program.) – ikegami Nov 16 '19 at 20:06
  • @ikegami yeah, but I didn't know how to call it -- by "stack" I meant some internal storage from which other tools can fetch the numbers. In this example there's a `while` that's getting its hands on the numbers, right as they are produced ... Is it just an anon storage on the heap, or in some protected areas ...? – zdim Nov 16 '19 at 22:26
  • Re "*there's a `while` that's getting its hands on the numbers, right as they are produced*", The whole list is produced up front. Only after the list is produced does `glob` return the first string (which is assigned to `$_`). Re "*Is it just an anon storage on the heap, or in some protected areas*", It's dynamically allocated, so it's on the heap. The list is placed in the opcode structure of the currently executing `glob` opcode (allowing a second `glob` to be used when iterating over the result of another `glob`). – ikegami Nov 16 '19 at 23:30
  • @ikegami OK, thank you -- that also explains things in the linked post :). What is that "_opcode structure_" ... on heap or free-store of some kind? Or some special memory segment for similar purposes? I'm thinking whether it's feasible to expect that some internal structure gets overrun ... – zdim Nov 16 '19 at 23:41
  • It's dynamically allocated, so it's on the heap. – ikegami Nov 16 '19 at 23:42
  • @ikegami OK... then I don't know what can fail in that process. They did find out that it fails on Windows and not on a Linux (a few comments up), but still ... what fails? Thank you --- editing, removed the term "stack" – zdim Nov 16 '19 at 23:47
  • I have no idea how or why it degenerates to 4 characters either, but I can reproduce it (Windows) – ikegami Nov 16 '19 at 23:48
  • @ikegami Ah! That's a revelation ... so there is definitely something going on (not just on OP's system I mean). These are extreme sizes but still, it apparently doesn't happen on Linux – zdim Nov 17 '19 at 00:01
  • @ikegami Thanks for confirming this on Windows. I actually tried on on my Windows 8.1 system with Perl 5.26 and on my Windows 10 devices with Perl 5.28 and the same result on both. On my Kali Linux VM it did in fact produce 5 characters as expected. One thing however I noticed is that my CPU and memory never really reached 90% on the Windows devices, but on the Linux system it would run average of 99.7 percent cpu with growing memory usage. The minute I would increment to 6 chars (for test purpose) it would flat line everything and eventually the terminal just says _"Killed"_ – Gerry Nov 17 '19 at 05:17
  • Thanks @ikegami and Zdim for your valuable input, I really appreciate the time and effort. – Gerry Nov 17 '19 at 05:18
6

Everything has some limitation.

Here's a pure Perl module that can do it for you iteratively. It doesn't generate the entire list at once and you start to get results immediately:

use v5.10;

use Set::CrossProduct;

my $set = Set::CrossProduct->new( [ ([ 'a'..'z' ]) x 5 ] );

while( my $item = $set->get ) {
    say join '', @$item
    }
brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • Man, You do not understand how happy I am right now. Thank you very much!! – Gerry Nov 16 '19 at 17:49
  • 3
    Algorithm::Loops's `NestedLoops` could also be used: `use Algorithm::Loops qw( NestedLoops ); NestedLoops([ ([ 'a'..'z' ]) x 5 ], sub { say join '', @_ } );` (An answer to an earlier question by the OP mentioned they could use this if they were running out of memory...) – ikegami Nov 16 '19 at 20:09