I've got to assume some things here, and leave some restrictions
Random regions to pick don't depend in any way on specific lines
Order doesn't matter; there need be N regions spread out through the file
File can be a Gigabyte in size, so can't read it whole (would be much easier!)
There are unhandled (edge or unlikely) cases, discussed after code
First build a sorted list of random numbers; these are positions in the file at which regions start. Then, as each line is read, compute its range of characters in the file, and check whether our numbers fall within it. If some do, they mark the start of each random region: pick substrings of desired length starting at those characters. Check whether substrings fit on the line.
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use List::MoreUtils qw(uniq);
my ($region_len, $num_regions) = (10, 10);
my $count_freq_for = 'F';
#srand(10);
GetOptions(
'num-regions|n=i' => \$num_regions,
'region-len|l=i' => \$region_len,
'char|c=s' => \$count_freq_for,
) or usage();
my $file = shift || usage();
# List of (up to) $num_regions random numbers, spanning the file size
# However, we skip all '>sp' lines so take more numbers (estimate)
open my $fh, '<', $file or die "Can't open $file: $!";
$num_regions += int $num_regions * fraction_skipped($fh);
my @rand = uniq sort { $a <=> $b }
map { int(rand (-s $file)-$region_len) } 1..$num_regions;
say "Starting positions for regions: @rand";
my ($nchars_prev, $nchars, $chars_left) = (0, 0, 0);
my $region;
while (my $line = <$fh>) {
chomp $line;
# Total number of characters so far, up to this line and with this line
$nchars_prev = $nchars;
$nchars += length $line;
next if $line =~ /^\s*>sp/;
# Complete the region if there wasn't enough chars on the previous line
if ($chars_left > 0) {
$region .= substr $line, 0, $chars_left;
my $cnt = () = $region =~ /$count_freq_for/g;
say "$region $cnt";
$chars_left = -1;
};
# Random positions that happen to be on this line
my @pos = grep { $_ > $nchars_prev and $_ < $nchars } @rand;
# say "\tPositions on ($nchars_prev -- $nchars) line: @pos" if @pos;
for (@pos) {
my $pos_in_line = $_ - $nchars_prev;
$region = substr $line, $pos_in_line, $region_len;
# Don't print if there aren't enough chars left on this line
last if ( $chars_left =
($region_len - (length($line) - $pos_in_line)) ) > 0;
my $cnt = () = $region =~ /$count_freq_for/g;
say "$region $cnt";
}
}
sub fraction_skipped {
my ($fh) = @_;
my ($skip_len, $data_len);
my $curr_pos = tell $fh;
seek $fh, 0, 0 if $curr_pos != 0;
while (<$fh>) {
chomp;
if (/^\s*>sp/) { $skip_len += length }
else { $data_len += length }
}
seek $fh, $curr_pos, 0; # leave it as we found it
return $skip_len / ($skip_len+$data_len);
}
sub usage {
say STDERR "Usage: $0 [options] file", "\n\toptions: ...";
exit;
}
Uncomment the srand
line so to have the same run always, for testing.
Notes follow.
Some corner cases
If the 10-long window doesn't fit on the line from its random position it is completed in the next line -- but any (possible) further random positions on this line are left out. So if our random list has 1120 and 1122 while a line ends at 1125 then the window starting at 1122 is skipped. Unlikely, possible, and of no consequence (other than having one region fewer).
When an incomplete region is filled up in the next line (the first if
in the while
loop), it is possible that that line is shorter than the remaining needed characters ($chars_left
). This is very unlikely and needs an additional check there, which is left out.
Random numbers are pruned of dupes. This skews the sequence, but minutely what should not matter here; and we may stay with fewer numbers than asked for, but only by very little
Handling of issues regarding randomness
"Randomness" here is pretty basic, what seems suitable. We also need to consider the following.
Random numbers are drawn over the interval spanning the file size, int(rand -s $file)
(minus the region size). But lines >sp
are skipped and any of our numbers that may fall within those lines won't be used, and so we may end up with fewer regions than the drawn numbers. Those lines are shorter, thus with a lesser chance of having numbers on them and so not many numbers are lost, but in some runs I saw even 3 out of 10 numbers skipped, ending up with a random sample 70% size of desired.
If this is a bother, there are ways to approach it. To not skew the distribution even further they all should involve pre-processing the file.
The code above makes an initial run over the file, to compute the fraction of chars that will be skipped. That is then used to increase the number of random points drawn. This is of course an "average" measure, but which should still produce the number of regions close to desired for large enough files.
More detailed measures would need to see which random points of a (much larger) distribution are going to be lost to skipped lines and then re-sample to account for that. This may still mess with the distribution, what arguably isn't an issue here, but more to the point may simply be unneeded.
In all this you read the big file twice. The extra processing time should only be in the seconds but if this is unacceptable change the function fraction_skipped
to read through only 10-20% of the file. With large files this should still provide a reasonable estimate.
Note on a particular test case
With srand(10)
(commented-out line near the beginning) we get the random numbers such that on one line the region starts 8 characters before the end of the line! So that case does test the code to complete the region on the next line.
Here is a simple driver to run the above a given number of times, for statistics.
Doing it using builtin tools (system
, qx
) is altogether harder and libraries (modules) help. I use IPC::Run here. There are quite a few other options.†
Adjust and add code to process as needed for statistics; output is in files.
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use IPC::Run qw(run);
my $outdir = 'rr_output'; # pick a directory name
mkdir $outdir if not -d $outdir;
my $prog = 'random_regions.pl'; # your name for the program
my $input = 'data_file.txt'; # your name for input file
my $ch = 'F';
my ($runs, $regions, $len) = (10, 10, 10);
GetOptions(
'runs|n=i' => \$runs,
'regions=i' => \$regions,
'length=i' => \$len,
'char=s' => \$ch,
'input=s' => \$input
) or usage();
my @cmd = ( $prog, $input,
'--num-regions', $regions,
'--region-len', $len,
'--char', $ch
);
say "Run: @cmd, $runs times.";
for my $n (1..$runs) {
my $outfile = "$outdir/regions_r$n.txt";
say "Run #$n, output in: $outdir/$outfile";
run \@cmd, '>', $outfile or die "Error with @cmd: $!";
}
sub usage {
say STDERR "Usage: $0 [options]", "\n\toptions: ...";
exit;
}
Please expand on the error checking. See for instance this post and links on details.
Simplest use: driver_random.pl -n 4
, but you can give all of main program's parameters.
The called program (random_regions.pl
above) must be executable.
† Some, from simple to more capable: IPC::System::Simple, Capture::Tiny, IPC::Run3. (Then comes IPC::Run
used here.) Also see String::ShellQuote, to prepare commands without quoting issues, shell injection bugs, and other problems. See links (examples) assembled in this post, for example.