Randomly pick a region and process it, a number of times

Question

I have a data like this

>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK

I want to randomly pick up a region with 10 letters from it then calculate the number of F, I want to do that for a certain number of times for example 1000 times or even more

as an example, I randomly pick

LVPSLTRYLT    0

then

ITNLRSFIHK    1

then again randomly go and pick up 10 letters consecutive

AHSRIRKERP    0

This continues until it meets the number of run asked. I want to store all randomly selected ones with their values, because then I want to calculate how many times F is seen

So I do the following

# first I remove the header 
grep -v ">" data.txt > out.txt

then get randomly one region with 10 letters I tried to use shuf with no success,

shuf -n1000 data.txt

then I tried to use awk and was not successful either

awk 'BEGIN {srand()} !/^$/ { if (rand() == 10) print $0}'

then calculate the number of F and save it in the file

grep -i -e [F] |wc -l

Note, we should not pick up the same region twice

what do the numbers next the 10 letter long words mean? Also how big is your file? — karakfa, Feb 12 '19 at 22:40
@karakfa 0 means there is no F, 1 means there is 1, 2 means there is 2 F in that selection etc etc. The file is rather huge :-) I know that `shuf` is very memory consumable but still I could not get it to use :-D — Learner, Feb 12 '19 at 22:42
(1) "_rather huge_" -- how big is your file? "huge" means different things to different people :) (2) Do lines `>sp[..]` count? Or do you mean to pick random regions only out of lines with data? — zdim, Feb 12 '19 at 22:53
@zdim 10000000 lines , random from any line but consecutive (for example cannot pick up a letter from each place, it should get a region including 10 letters) for a number of times — Learner, Feb 12 '19 at 23:00
Clarify `Note, we should not pick up the same region twice`. If we pick the region of chars at positions `5-15` - can we later pick the region `10-20` or is that considered picking up the same region twice since they overlap at `10-15`? — Ed Morton, Feb 13 '19 at 16:07

zdim · Accepted Answer · 2022-04-05T16:53:30.903

I've got to assume some things here, and leave some restrictions

Random regions to pick don't depend in any way on specific lines
Order doesn't matter; there need be N regions spread out through the file
File can be a Gigabyte in size, so can't read it whole (would be much easier!)
There are unhandled (edge or unlikely) cases, discussed after code

First build a sorted list of random numbers; these are positions in the file at which regions start. Then, as each line is read, compute its range of characters in the file, and check whether our numbers fall within it. If some do, they mark the start of each random region: pick substrings of desired length starting at those characters. Check whether substrings fit on the line.

use warnings;
use strict;
use feature 'say';

use Getopt::Long;
use List::MoreUtils qw(uniq);

my ($region_len, $num_regions) = (10, 10);
my $count_freq_for = 'F';
#srand(10);

GetOptions(
    'num-regions|n=i' => \$num_regions, 
    'region-len|l=i'  => \$region_len, 
    'char|c=s'        => \$count_freq_for,
) or usage();

my $file = shift || usage();

# List of (up to) $num_regions random numbers, spanning the file size
# However, we skip all '>sp' lines so take more numbers (estimate)
open my $fh, '<', $file  or die "Can't open $file: $!";
$num_regions += int $num_regions * fraction_skipped($fh);
my @rand = uniq sort { $a <=> $b } 
    map { int(rand (-s $file)-$region_len) } 1..$num_regions;
say "Starting positions for regions: @rand";

my ($nchars_prev, $nchars, $chars_left) = (0, 0, 0); 

my $region;

while (my $line = <$fh>) { 
    chomp $line;
    # Total number of characters so far, up to this line and with this line
    $nchars_prev = $nchars;
    $nchars += length $line;
    next if $line =~ /^\s*>sp/;

    # Complete the region if there wasn't enough chars on the previous line 
    if ($chars_left > 0) {
        $region .= substr $line, 0, $chars_left;
        my $cnt = () = $region =~ /$count_freq_for/g;
        say "$region $cnt";
        $chars_left = -1; 
    };  

    # Random positions that happen to be on this line    
    my @pos = grep { $_ > $nchars_prev and $_ < $nchars } @rand;
    # say "\tPositions on ($nchars_prev -- $nchars) line: @pos" if @pos;

    for (@pos) { 
        my $pos_in_line = $_ - $nchars_prev;
        $region = substr $line, $pos_in_line, $region_len; 

        # Don't print if there aren't enough chars left on this line
        last if ( $chars_left = 
            ($region_len - (length($line) - $pos_in_line)) ) > 0;

        my $cnt = () = $region =~ /$count_freq_for/g;
        say "$region $cnt";
    }   
}


sub fraction_skipped {
    my ($fh) = @_;
    my ($skip_len, $data_len);
    my $curr_pos = tell $fh;
    seek $fh, 0, 0  if $curr_pos != 0;
    while (<$fh>) {
        chomp;
        if (/^\s*>sp/) { $skip_len += length }
        else           { $data_len += length }
    }
    seek $fh, $curr_pos, 0;  # leave it as we found it
    return $skip_len / ($skip_len+$data_len);
}

sub usage {
    say STDERR "Usage: $0 [options] file", "\n\toptions: ...";
    exit;
}

Uncomment the srand line so to have the same run always, for testing. Notes follow.

Some corner cases

If the 10-long window doesn't fit on the line from its random position it is completed in the next line -- but any (possible) further random positions on this line are left out. So if our random list has 1120 and 1122 while a line ends at 1125 then the window starting at 1122 is skipped. Unlikely, possible, and of no consequence (other than having one region fewer).
When an incomplete region is filled up in the next line (the first if in the while loop), it is possible that that line is shorter than the remaining needed characters ($chars_left). This is very unlikely and needs an additional check there, which is left out.
Random numbers are pruned of dupes. This skews the sequence, but minutely what should not matter here; and we may stay with fewer numbers than asked for, but only by very little

Handling of issues regarding randomness

"Randomness" here is pretty basic, what seems suitable. We also need to consider the following.

Random numbers are drawn over the interval spanning the file size, int(rand -s $file) (minus the region size). But lines >sp are skipped and any of our numbers that may fall within those lines won't be used, and so we may end up with fewer regions than the drawn numbers. Those lines are shorter, thus with a lesser chance of having numbers on them and so not many numbers are lost, but in some runs I saw even 3 out of 10 numbers skipped, ending up with a random sample 70% size of desired.

If this is a bother, there are ways to approach it. To not skew the distribution even further they all should involve pre-processing the file.

The code above makes an initial run over the file, to compute the fraction of chars that will be skipped. That is then used to increase the number of random points drawn. This is of course an "average" measure, but which should still produce the number of regions close to desired for large enough files.

More detailed measures would need to see which random points of a (much larger) distribution are going to be lost to skipped lines and then re-sample to account for that. This may still mess with the distribution, what arguably isn't an issue here, but more to the point may simply be unneeded.

In all this you read the big file twice. The extra processing time should only be in the seconds but if this is unacceptable change the function fraction_skipped to read through only 10-20% of the file. With large files this should still provide a reasonable estimate.

Note on a particular test case

With srand(10) (commented-out line near the beginning) we get the random numbers such that on one line the region starts 8 characters before the end of the line! So that case does test the code to complete the region on the next line.

Here is a simple driver to run the above a given number of times, for statistics.

Doing it using builtin tools (system, qx) is altogether harder and libraries (modules) help. I use IPC::Run here. There are quite a few other options.^†

Adjust and add code to process as needed for statistics; output is in files.

use warnings;
use strict;
use feature 'say';

use Getopt::Long;
use IPC::Run qw(run);

my $outdir = 'rr_output';         # pick a directory name
mkdir $outdir if not -d $outdir;    
my $prog  = 'random_regions.pl';  # your name for the program
my $input = 'data_file.txt';      # your name for input file     
my $ch = 'F';

my ($runs, $regions, $len) = (10, 10, 10);    
GetOptions(
    'runs|n=i'  => \$runs, 
    'regions=i' => \$regions, 
    'length=i'  => \$len, 
    'char=s'    => \$ch, 
    'input=s'   => \$input
) or usage();

my @cmd = ( $prog, $input, 
    '--num-regions', $regions, 
    '--region-len', $len, 
    '--char', $ch
);    
say "Run: @cmd, $runs times.";

for my $n (1..$runs) {
    my $outfile = "$outdir/regions_r$n.txt";
    say "Run #$n, output in: $outdir/$outfile";
    run \@cmd, '>', $outfile  or die "Error with @cmd: $!";
}    

sub usage {
    say STDERR "Usage: $0 [options]", "\n\toptions: ...";
    exit;
}

Please expand on the error checking. See for instance this post and links on details.

Simplest use: driver_random.pl -n 4, but you can give all of main program's parameters.

The called program (random_regions.pl above) must be executable.

^† Some, from simple to more capable: IPC::System::Simple, Capture::Tiny, IPC::Run3. (Then comes IPC::Run used here.) Also see String::ShellQuote, to prepare commands without quoting issues, shell injection bugs, and other problems. See links (examples) assembled in this post, for example.

karakfa · Answer 2 · 2019-02-12T23:02:58.267

1

awk to the rescue!

you didn't specify but there are two random actions going on. I treated them independently, may not be so. First picking a line and second picking a random 10 letter substring from that line.

This assumes the file (or actually half of it) can fit in memory. Otherwise, split the file into equal chunks and run this on chunks. Doing so will reduce some of the clustering but not sure how important in this case. (If you have one big file, it's possible that all samples may be drawn from the first half, with splitting you eliminate this probability). For certain cases this is a desired property. Don't know your case.

$ awk 'BEGIN {srand()} 
       !/^>/ {a[++n]=$0} 
       END   {while(i++<1000) 
                {line=a[int(rand()*n)+1]; 
                 s=int(rand()*(length(line)-9));
                 print ss=substr(line,s,10), gsub(/F/,"",ss)}}' file

GERISPKDRC 0
QDEPRCLHEW 0
LLYQLFRNLF 2
GTHGAGAMES 0
TKALQDVQIR 0
FCVHTKALQD 1
SNKAQVKPGQ 0
CMECQGHGER 0
TRRFVGHTKD 1
...

edited Feb 12 '19 at 23:02

answered Feb 12 '19 at 22:57

karakfa

66,216
7
41
56

you said you picked each line once and all together, I see only one solution, do you mind to post the two of them? when you picked up randomly one section of each line ? – Learner Feb 12 '19 at 23:18
This is it. Each line can be picked multiple times. – karakfa Feb 12 '19 at 23:45
Since it's selecting a random 10 char string from a random line on each iteration, that could print the same 10-char string from the same line more than once but the OP said `Note, we should not pick up the same region twice`. – Ed Morton Feb 13 '19 at 19:21
yes, I know. It was modified after my answer. For a true random sample it should be able have duplicates. Still not clear whether each line could be sampled more than once. Overlapping sequences have their place in sampling as well. Also not clear it's allowed or not. I gave up. – karakfa Feb 13 '19 at 19:28
"_For a true random sample it should be able have duplicates_" -- right, but then most good generators (that I know of) don't. If any decent randomness is needed this is a very complex problem really -- but that probably just isn't the case – zdim Feb 13 '19 at 19:48
more fundamental than that. I guess here it's intended to have a subrandom (or low-discrepancy) to cover the space more or less evenly avoiding clustering (or repeats) that come from random sequences. – karakfa Feb 13 '19 at 19:59

score 0 · Answer 3 · answered Feb 12 '19 at 23:56

0

Here is one solution using Perl

It slurps the entire file to memory. Then the lines starting with > are removed. Here I'm looping for 10 times $i<10, you can increase the count here. Then rand function is called by passing length of the file and using the rand value, substr of 10 is computed. $s!~/\n/ guard is to make sure we don't choose the substring crossing newlines.

$ perl -0777 -ne '$_=~s/^>.+?\n//smg; while($i<10) { $x=rand(length($_)); $s=substr($_,$x,10); $f=()=$s=~/F/g; if($s!~/\n/) { print "$s $f\n" ;$i++} else { $i-- } } '
random10.txt
ENTQLLETKN 0
LSEGALSPDG 0
LRKARAEAED 0
RLWDLTTGTT 0
KWSGRCGLGY 0
TRRFVGHTKD 1
PVKRPIPHPA 0
GMVQQIQSVC 0
LTHPVLSFGI 1
KVNFPENGFL 2

$

To know the random number generated

$ perl -0777 -ne '$_=~s/^>.+?\n//smg; while($i<10) { $x=rand(length($_)); $s=substr($_,$x,10); $f=()=$s=~/F/g; if($s!~/\n/) { print "$s $f $x\n" ;$i++} else { $i-- } }
 ' random10.txt
QLDGSLTMSS 0 1378.61409368207
DLIAKVDELT 0 1703.46689004765
SGGGANGTSF 1 900.269562152326
PEELTLSPKL 0 1368.55540468164
TCLSEGALSP 0 1016.50744004085
NRTWNSSAVP 0 23.7868578293154
VNFPENGFLS 2 363.527933104776
NSGLTWSGND 0 48.656607650744
MILSASRDKT 0 422.67705815168
RRGEDLFMCM 1 290.828530365
AGDGLLTPDA 0 1481.78080339531

$

answered Feb 12 '19 at 23:56

stack0114106

8,534
3
13
38

I am just wondering whey it does not have any threshold number, for example 1000 times, it can go forever when we use while, no? – Learner Feb 12 '19 at 23:59
yes.. if you just pass while(1), it will go forever.. just for capturing the answer, I made it 10 – stack0114106 Feb 13 '19 at 00:00
what are those numbers ? how did you calculate those ? – Learner Feb 13 '19 at 00:02
it is generated by Perl with the cap of your file length(size) after removing lines starting with > – stack0114106 Feb 13 '19 at 00:03
sure but what for example 1378.61409368207 means? – Learner Feb 13 '19 at 00:07
perl just gives random values in float.. when we use that value in integer context (i.e substr) it will truncate to 1378 – stack0114106 Feb 13 '19 at 00:10
To show an example.. ````echo QLDGSLTMSS | perl -lne ' $x=rand(10); print $x; print substr($_,$x) ' 2.86913719101907 DGSLTMSS```` – stack0114106 Feb 13 '19 at 00:10

Ed Morton · Answer 4 · 2019-02-13T19:13:23.733

Since your input file is huge I'd do it in these steps:

select random 10-char strings from each line of your input file
shuffle those to get the number of samples you want in random order
count the Fs

e.g.

$ cat tst.sh
#!/bin/env bash
infile="$1"

sampleSize=10
numSamples=15

awk -v sampleSize="$sampleSize" '
    BEGIN { srand() }
    !/^>/ {
        begPos = int((rand() * sampleSize) + 1)
        endPos = length($0) - sampleSize
        for (i=begPos; i<=endPos; i+=sampleSize) {
            print substr($0,i,sampleSize)
        }
    }
' "$infile" |
shuf -n "$numSamples"

.

$ ./tst.sh file
HGDIKCVLNE
QDEPRCLHEW
SEVQAIIEST
THDLRVSLEE
SEWVSCVRFS
LTRYLTLNAS
KDGQKITFHG
SNSPEPQKAV
QGGSKATTPA
QLLETKNALN
LLFCDNHKKQ
DETNYGIPQR
IRFQPQLNPD
LQTIRFSPDI
SLKRCGGFLI

$ ./tst.sh file | awk '{print $0, gsub(/F/,"")}'
SPKLQLDGSL 0
IKLFCVHTKA 1
VVSRCRLRHT 0
SPEPQKAVEQ 0
AYNPKNFSND 1
FGESRPELGS 1
AGDGLLTPDA 0
VGHTKDVLSV 0
VTHDLRVSLE 0
PISLGIFPLP 1
ASQITNLRSF 1
LTRPPEELTL 0
FDRYGEEGLK 1
IYIEGQDEPR 0
WNTLGVCKYT 0

Just change the numSamples from 15 to 1000 or whatever you like when run against your real data.

The above relies on shuf -n being able to handle however much input we throw at it, presumably much like sort does by using paging. If it fails in that regard then obviously you'd have to choose/implement a different tool for that part. FWIW I tried seq 100000000 | shuf -n 10000 (i.e. 10 times as many input lines as the OPs posted max file length of 10000000 to account for the awk part generating N lines of output per 1 line of input and 10 times as many output lines required than the OPs posted 1000) and it worked fine and only took a few secs to complete.

Randomly pick a region and process it, a number of times

4 Answers4

Linked