Count subsequences in hundreds of GB of data

Question

I'm trying to process a very large file and tally the frequency of all sequences of a certain length in the file.

To illustrate what I'm doing, consider a small input file containing the sequence abcdefabcgbacbdebdbbcaebfebfebfeb

Below, the code reads the whole file in, and takes the first substring of length n (below I set this to 5, although I want to be able to change this) and counts its frequency:

abcde => 1

Next line, it moves one character to the right and does the same:

bcdef => 1

It then continues for the rest of the string and prints the 5 most frequent sequences:

open my $in, '<', 'in.txt' or die $!; # 'abcdefabcgbacbdebdbbcaebfebfebfeb'

my $seq = <$in>; # read whole file into string
my $len = length($seq);

my $seq_length = 5; # set k-mer length
my %data;

for (my $i = 0; $i <= $len - $seq_length; $i++) {
     my $kmer = substr($seq, $i, $seq_length);
     $data{$kmer}++;
}

# print the hash, showing only the 5 most frequent k-mers
my $count = 0;
foreach my $kmer (sort { $data{$b} <=> $data{$a} } keys %data ){
    print "$kmer $data{$kmer}\n";
    $count++;
    last if $count >= 5;
}

ebfeb 3
febfe 2
bfebf 2
bcaeb 1
abcgb 1

However, I would like to find a more efficient way of achieving this. If the input file was 10GB or 1000GB, then reading the whole thing into a string would be very memory expensive.

I thought about reading in blocks of characters, say 100 at a time and proceeding as above, but here, sequences that span 2 blocks would not be tallied correctly.

My idea then, is to only read in n number of characters from the string, and then move onto the next n number of characters and do the same, tallying their frequency in a hash as above.

Are there any suggestions about how I could do this? I've had a look a read using an offset, but can't get my head around how I could incorporate this here
Is substr the most memory efficient tool for this task?

The main thing to remember is that after you've read the first block, you need to keep the last k-1 characters from the previous block available for use in building k-mers with the first few characters of the next block. There isn't going to be anything notably different in memory efficiency than `substr`. — Jonathan Leffler, Mar 24 '16 at 14:01
I think this question would benefit from a bit more explanation - please bear in mind that not all of us know what you're talking about when you refer to "k-mers" so can't really figure out what you're trying to accomplish. — Sobrique, Mar 24 '16 at 14:01
Re "*this seems to be a very memory expensive way of doing things*", What do you mean, "seems"? Are you having memory issues or not? — ikegami, Mar 24 '16 at 14:02
use ``read FILEHANDLE,SCALAR,LENGTH`` to get 4 meg (or something moderately small) chunks, do your existing for loop scan, save the last seq_length -1 bytes and append them onto the next chunk. Repeat to EOF and then output your most frequent — Vorsprung, Mar 24 '16 at 14:15
@ikegami - I'm trying to find a solution that would be as efficient with a 1GB file as it would be with a 100GB file. My solution would perform very slowly on larger files — fugu, Mar 24 '16 at 15:18
I'll post something when I can. // Does the file consist of a small set of different characters? — ikegami, Mar 24 '16 at 15:22

Borodin · Accepted Answer · 2016-07-04T09:45:51.370

5

From your own code it's looking like your data file has just a single line of data -- not broken up by newline characters -- so I've assumed that in my solution below. Even if it's possible that the line has one newline character at the end, the selection of the five most frequent subsequences at the end will throw this out as it happens only once

This program uses sysread to fetch an arbitrarily-sized chunk of data from the file and append it to the data we already have in memory

The body of the loop is mostly similar to your own code, but I have used the list version of for instead of the C-style one as it is much clearer

After processing each chunk, the in-memory data is truncated to the last SEQ_LENGTH-1 bytes before the next cycle of the loop pulls in more data from the file

I've also use constants for the K-mer size and the chunk size. They are constant after all!

The output data was produced with CHUNK_SIZE set to 7 so that there would be many instances of cross-boundary subsequences. It matches your own required output except for the last two entries with a count of 1. That is because of the inherent random order of Perl's hash keys, and if you require a specific order of sequences with equal counts then you must specify it so that I can change the sort

use strict;
use warnings 'all';

use constant SEQ_LENGTH => 5;           # K-mer length
use constant CHUNK_SIZE => 1024 * 1024; # Chunk size - say 1MB

my $in_file = shift // 'in.txt';

open my $in_fh, '<', $in_file or die qq{Unable to open "$in_file" for input: $!};

my %data;
my $chunk;
my $length = 0;

while ( my $size = sysread $in_fh, $chunk, CHUNK_SIZE, $length ) {

    $length += $size;

    for my $offset ( 0 .. $length - SEQ_LENGTH ) {
         my $kmer = substr $chunk, $offset, SEQ_LENGTH;
         ++$data{$kmer};
    }

    $chunk = substr $chunk, -(SEQ_LENGTH-1);
    $length = length $chunk;
}

my @kmers = sort { $data{$b} <=> $data{$a} } keys %data;
print "$_ $data{$_}\n" for @kmers[0..4];

output

ebfeb 3
febfe 2
bfebf 2
gbacb 1
acbde 1

Note the line: $chunk = substr $chunk, -(SEQ_LENGTH-1); which sets $chunk as we pass through the while loop. This ensures that strings spanning 2 chunks get counted correctly.

The $chunk = substr $chunk, -4 statement removes all but the last four characters from the current chunk so that the next read appends CHUNK_SIZE bytes from the file to those remaining characters. This way the search will continue, but starts with the last 4 of the previous chunk's characters in addition to the next chunk: data doesn't fall into a "crack" between the chunks.

edited Jul 04 '16 at 09:45

answered Mar 24 '16 at 15:53

Borodin

126,100
9
70
144

1

Excellent - thanks. So does the `$chunk = substr $chunk, -(SEQ_LENGTH-1);` ensure that k-mers spanning 2 chunks get counted correctly? – fugu Mar 24 '16 at 17:32
2

@fugu: Pretty much, yes. `SEQ_LENGTH` is 5, so `SEQ_LENGTH-1` is 4, and `$chunk = substr $chunk, -4` will remove all but the last four characters from the current chunk, and the next `read` will append the next `CHUNK_SIZE` bytes from the file to it. The search will then continue starting with the last 4 of the old characters plus the first of the new, then the last three of the old characters plus the first two of the new, etc. – Borodin Mar 24 '16 at 17:43
Is this `while(read ...){ for } ` approach the standard efficient "idiom" for reading input a "sip at a time" in perl? @fugu clarified whether the application was supposed to read a `$chunk` at a time sequentially or slide a window of length 5 along the sequence looking for strings (as @Jonathan Lethler. noted) and I thought of lazy lists. But a really long string would first have to be made into a list. – G. Cito Mar 24 '16 at 17:43
@G.Cito: It's the way I've chosen to write it. There is no *"the standard efficient "idiom" for this"* and, true to Perl's roots, there are many ways to do the same thing. My algorithm creates a window of size `CHUNK_SIZE + SEQ_LENGTH - 1` characters and generally steps it on by `CHUNK_SIZE` characters at a time. I don't know what you mean by *"a really long string would first have to be made into a list"*. It can't be true as my code doesn't create a list – Borodin Mar 24 '16 at 17:47
@Borodin - I was seeing the input data as one very long string (as you were?). Rather than use the sample provided I created a (still small) data file with 20k characters in one big long string. To lazy list shift a five character "window" along the string I then read it in and made it into a list. I will add a response that illustrates. – G. Cito Mar 24 '16 at 18:19
@Borodin - Just to check: As file size increases, my version will become slower and slower for 2 reasons: 1) reading a file into a string becomes expensive with increasing file size and 2) the expense of building a hash with long k-mers increases with k-mer size. Your version will suffer the 2nd problem, but not the first. Is that the case? – fugu Mar 25 '16 at 10:42
@Borodin - I'm confused, because I've just run our programs side by side on the same data (1GB) for 10-mers and they both finished in almost exactly the same time. Should I only expect to see efficiency increases at much greater file sizes? – fugu Mar 25 '16 at 11:17
1

@fugu: The process is probably disk-bound, which means it's primarily limited by the speed you can read from a disk drive. The advantage my solution has over yours is that reading a 100GB file into memory in one shot is impossible, which is why I have focused on reading in 1MB chunks. How long did the programs take for the 1GB data? Please describe the *"incorrect results"* – Borodin Mar 25 '16 at 13:23
@Borodin - On my 8GB MBP both took around 25 mins. – fugu Mar 25 '16 at 14:09
3

@fugu I think you're confusing the terms "efficiency", "speed", and "memory usage". Borodin's solution is *efficient* because it reads each chunk of the file only once (just like your solution). Which solution is *faster* depends on whether `readline` is faster than a series of `read` calls, among other things. The difference in speed may not be huge because I/O will be the bottleneck in both cases. Borodin's solution uses much less *memory*, so it will work for files of any length; your solution won't even run for large files, which has nothing to do with efficiency. – ThisSuitIsBlackNot Mar 25 '16 at 14:45
@ThisSuitIsBlackNot - Appreciate the clarification. I'm not from a computer science background, and I've never worked on a program where I've actually thought in terms of memory usage etc. – fugu Mar 25 '16 at 14:48
1

@fugu You're welcome. You usually don't have to think about performance and resource utilization too much, but when you do, it's important to be clear with your terminology. "Efficient" is kind of a vague term, so it's better to specify your memory and time constraints. By the way, switching `read` to [`sysread`](http://perldoc.perl.org/functions/sysread.html) might make Borodin's solution run a little faster. – ThisSuitIsBlackNot Mar 25 '16 at 15:06
@Borodin I think you should add the explanation of `SEQ_LENGTH` so `SEQ_LENGTH-1` in the second comment to the body of the question. – G. Cito Mar 29 '16 at 13:49

ikegami · Answer 2 · 2016-03-28T16:28:33.887

Even if you don't read the entire file into memory before processing it, you could still run out of memory.

A 10 GiB file contains almost 11E9 sequences.

If your sequences are sequences of 5 characters chosen from a set of 5 characters, there are only 5⁵ = 3,125 unique sequences, and this would easily fit in memory.

If your sequences are sequences of 20 characters chosen from a set of 5 characters, there are 5²⁰ = 95E12 unique sequences, so the all 11E9 sequences of a 10 GiB file could unique. That does not fit in memory.

In that case, I suggest doing the following:

Create a file that contains all the sequences of the original file.

The following reads the file in chunks rather than all at once. The tricky part is handling sequences that span two blocks. The following program uses sysread^[1] to fetch an arbitrarily-sized chunk of data from the file and append it to the last few character of the previously read block. This last detail allows sequences that span blocks to be counted.

perl -e'
   use strict;
   use warnings qw( all );

   use constant SEQ_LENGTH => 20;
   use constant CHUNK_SIZE => 1024 * 1024;

   my $buf = "";
   while (1) {
      my $size = sysread(\*STDIN, $buf, CHUNK_SIZE, length($buf));
      die($!) if !defined($size);
      last if !$size;

      for my $offset ( 0 .. length($buf) - SEQ_LENGTH ) {
         print(substr($buf, $offset, SEQ_LENGTH), "\n");
      }

      substr($buf, 0, -(SEQ_LENGTH-1), "");
   }
' <in.txt >sequences.txt

Sort the sequences.

sort sequences.txt >sorted_sequences.txt

Count the number of instances of each sequeunces, and store the count along with the sequences in another file.

perl -e'
   use strict;
   use warnings qw( all );

   my $last = "";           
   my $count;
   while (<>) {
      chomp;
      if ($_ eq $last) {
         ++$count;
      } else {
         print("$count $last\n") if $count;
         $last = $_;
         $count = 1;
      }
   }
' sorted_sequences.txt >counted_sequences.txt

Sort the sequences by count.

sort -rns counted_sequences.txt >sorted_counted_sequences.txt

Extract the results.

perl -e'
   use strict;
   use warnings qw( all );

   my $last_count;
   while (<>) {
      my ($count, $seq) = split;
      last if $. > 5 && $count != $last_count;
      print("$seq $count\n");
      $last_count = $count;
   }
' sorted_counted_sequences.txt

This also prints ties for 5th place.

This can be optimized by tweaking the parameters passed to sort^[2], but it should offer decent performance.

sysread is faster than previously suggested read since the latter performs a series of 4 KiB or 8 KiB reads (depending on your version of Perl) internally.
Given the fixed-length nature of the sequence, you could also compress the sequences into ceil(log₂₅₆(5²⁰)) = 6 bytes then base64-encode them into ceil(6 * 4/3) = 8 bytes. That means 12 fewer bytes would be needed per sequence, greatly reducing the amount to read and to write.

_{Portions of this answer was adapted from content by user:622310 licensed under cc by-sa 3.0.}

I ran a lot of tests yesterday, and realised that increasing the k-mer length (`SEQ_LENGTH`) drastically reduces the performance. On my machine, with a 1GB test file, and a k-mer length set to 20 I run out of memory. I realised that this is because, as you say, if every sequence is unique then it builds a very large hash. — fugu, Mar 26 '16 at 09:36
Note - In my question, I arbitrarily set the k-mer length to 5. I should have corrected @Borodin when he assumed it was a constant in my program - it's not. — fugu, Mar 26 '16 at 09:38
I will ask another question tomorrow, but assume that input files will be smaller than 10GB. Thanks for your suggestions so far! — fugu, Mar 27 '16 at 09:50
A 10 GiB `in.txt` produces a 210 GiB (without the compression I mentioned) or 90 GiB (with the compression I mentioned) `sequences.txt`. Hardly problematic. — ikegami, Mar 28 '16 at 16:24
@borodin and ikegami ... is `sysread` just always better than `read` for raw fetching of data? `perlfunc` gives advice that is not entirely clear on when to choose which. For example with the `Stream::Reader` script from my answer I replaced `read` with `sysread` in `Reader.pm` (line 271 I think) and gained 9-10% But I am not clear on what the tradeoffs might be. — G. Cito, Mar 30 '16 at 16:30
@G. Cito, That's not really related, and it would make a good question. Got an answer for you if you ask it. — ikegami, Mar 30 '16 at 16:49

G. Cito · Answer 3 · 2016-04-02T13:30:32.833

The most straightforward approach is to use the substr() function:

% time perl -e '$/ = \1048576; 
           while ($s = <>) { for $i (0..length $s) { 
             $hash{ substr($s, $i, 5) }++ } }  
           foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
             print "$k $hash{$k}\n"; $it++; last if $it == 5;}' nucleotide.data  
NNCTA 337530
GNGGA 337362
NCACT 337304
GANGN 337290
ACGGC 337210
      269.79 real       268.92 user         0.66 sys

The Perl Monks node on iterating along a string was a useful resource, as were the responses and comments from @Jonathan Leffler, @ÆvarArnfjörðBjarmason, @Vorsprung, @ThisSuitIsBlackNotm @borodin and @ikegami here in this SO posting. As was pointed out, the issue with very large files is memory, which in turn requires that files be read in chunks. When reading from a file in chunks, if your code is iterating through the data it has to properly handle switching from one chunk/source to the next without dropping any bytes.

As a simplistic example, next unless length $kmer == 5; will get checked during each 1048576 byte/character iteration in the script above, meaning strings that exist at the end of one chunk and the beginning of another will be missed (cf. @ikegami's and @Borodin's solutions). This will alter the resulting count, though perhaps not in a statistically significant way^[1]. Both @borodin and @ikegami address the issue of missing/overlapping strings between chunks by appending each chunk to the remaining characters of the previous chunk as they sysread in their while() loops. See Borodin's response and comments for an explanation of how it works.

Using Stream::Reader

Since perl has been around for quite a while and has collected a lot of useful code, another perfectly valid approach is to look for a CPAN module that achieves the same end. Stream::Reader can create a "stream" interface to a file handle that wraps the solution to the chunking issue behind a set of convenient functions for accessing the data.

use Stream::Reader; 
use strict;
use warnings;

open( my $handler, "<", shift ); 
my $stream = Stream::Reader->new( $handler, { Mode => "UB" } ); 

my %hash;
my $string;
while ($stream->readto("\n", { Out => \$string }) ) { 
    foreach my $i (0..length $string) { 
       $hash{ substr($string, $i, 5) }++ 
    } 
} 

my $it;
foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash ) { 
       print "$k $hash{$k}\n"; 
       $it++; last if $it == 5;
}

On a test data file nucleotide.data, both Borodin's script and the Stream::Reader approach shown above produced the same top five results. Note the small difference compared to the results from the shell command above. This illustrates the need to properly handle reading data in chunks.

NNCTA 337530
GNGGA 337362
NCACT 337305
GANGN 337290
ACGGC 337210

The Stream::Reader based script was significantly faster:

time perl sequence_search_stream-reader.pl nucleotide.data   
252.12s
time perl sequence_search_borodin.pl nucleotide.data     
350.57s

The file nucleotide.data was a 1Gb in size, consisting of single string of approximately 1 billion characters:

% wc nucleotide.data
       0       0 1048576000 nucleotide.data
% echo `head -c 20 nucleotide.data`
NCCANGCTNGGNCGNNANNA

I used this command to create the file:

perl -MString::Random=random_regex -e '
 open (my $fh, ">>", "nucleotide.data");
 for (0..999) { print $fh random_regex(q|[GCNTA]{1048576}|) ;}'

Lists and Strings

Since the application is supposed to read a chunk at a time and move this $seq_length sized window along the length of the data building a hash for tracking string frequency, I thought a "lazy list" approach might work here. But, to move a window through a collection of data (or slide as with List::Gen) reading elements natatime, one needs a list.

I was seeing the data as one very long string which would first have to be made into a list for this approach to work. I'm not sure how efficient this can be made. Nevertheless, here is my attempt at a "lazy list" approach to the question:

use List::Gen 'slide';

$/ = \1048575; # Read a million character/bytes at a time.
my %hash;

while (my $seq = <>) {
  chomp $seq;
  foreach my $kmer (slide { join("", @_) } 5 => split //, $seq) {
    next unless length $kmer == 5;
    $hash{$kmer}++;
  }
}

foreach my $k (sort { $hash{$b} <=> $hash{$a} } keys %hash) {
  print "$k $hash{$k}\n";
  $it++; last if $it == 5;
}

I'm not sure this is "typical perl" (TIMTOWDI of course) and I suppose there are other techniques (cf. gather/take) and utilities suitable for this task. I like the response from @Borodin best since it seems to be the most common way to take on this task and is more efficient for the potentially large file sizes that were mentioned (100Gb).

Is there a fast/best way to turn a string into a list or object? Using an incremental read() or sysread() with substr wins on this point, but even with sysread a 1000Gb string would require a lot of memory just for the resulting hash. Perhaps a technique that serialized/cached the hash to disk as it grew beyond a certain size would work with very, very large strings that were liable to create very large hashes.

Postscript and Results

The List::Gen approach was consistently between 5 and 6 times slower than @Borodin's approach. The fastest script used the Stream::Reader module. Results were consistent and each script selected the same top five strings with the two smaller files:

1 million character nucleotide string

sequence_search_stream-reader.pl :     0.26s
sequence_search_borodin.pl       :     0.39s
sequence_search_listgen.pl       :     2.04s

83 million character nucleotide string

With the data in file xaa:

wc xaa
       0       1 83886080 xaa

% time perl sequence_search_stream-reader.pl xaa
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
       21.33 real        20.95 user         0.35 sys

% time perl sequence_search_borodin.pl xaa     
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
       28.13 real        28.08 user         0.03 sys

% time perl sequence_search_listgen.pl xaa 
GGCNG 31510
TAGNN 31182
AACTA 30944
GTCAN 30792
ANTAT 30756
      157.54 real       156.93 user         0.45 sys

1 billion character nucleotide string

In a larger file the differences were of similar magnitude but, because as written it does not correctly handle sequences spanning chunk boundaries, the List::Gen script had the same discrepancy as the shell command line at the beginning of this post. The larger file meant a number of chunk boundaries and a discrepancy in the count.

sequence_search_stream-reader.pl :   252.12s
sequence_search_borodin.pl       :   350.57s
sequence_search_listgen.pl       :  1928.34s

The chunk boundary issue can of course be resolved, but I'd be interested to know about other potential errors or bottlenecks that are introduced using a "lazy list" approach. If there were any benefit in terms of CPU usage from using slide to "lazily" move along the string, it seems to be rendered moot by the need to make a list out of the string before starting.

I'm not surprised that reading data across chunk boundaries is left as an implementation exercise (perhaps it cannot be handled "magically") but I wonder what other CPAN modules or well worn subroutine style solutions might exist.

1. Skipping four characters - and thus four 5 character string combinations - at the end of each megabyte read of a terabyte file would mean the results would not include 3/10000 of 1% from the final count.

echo "scale=10; 100 *  (1024^4/1024^2 ) * 4 / 1024^4 " | bc
.0003814697

The OP talks about file sizes of between 10GB and 1000GB. You are reading all of that into memory at once, copying it to a list of individual characters, and then creating another list of five-character substrings. That accounts for between 70GB and 7000GB of memory in use all at once, without allowing for Perl's overhead on each data item. I would be very surprised if the OP has a system with capabilities like that — Borodin, Mar 25 '16 at 00:06
@Borodin Yup just tried it on a 32Gb system with 4 cores and an 800MB (NB **MB**) file. My script gets killed, yours never finishes ... trying to figure that out. On a million character string of nucleotides your script takes 0.33 seconds .. min 2.0 seconds.. — G. Cito, Mar 25 '16 at 00:17
How long did you wait to decide that my code *"never finishes"*? I would expect it to take around five or ten minutes to process 800MB — Borodin, Mar 25 '16 at 00:39
@Borodin I waited over 10 minutes and killed the process because my CPU was overheating! So this morning I killed a number of other processes switched `read` to `sysread` and tried again and it finished in 6 minutes. Cheers! — G. Cito, Mar 25 '16 at 15:18
@Borodin I definitely want to learn more about how to how to handle large (1Tb) files more effectively, conserving memory and other resources. For now, setting `$/ = \1048575;` allowed my approach to work with large files but it was very slow :) — G. Cito, Mar 25 '16 at 20:40
That won't make it work. It will just read the first 1MB-1 (why not use a whole megabyte -- 1,048,576 bytes?) of the file and process that, ignoring the rest of the data. And if your CPU is overheating then you have a cooling problem; either the heatsink isn't properly thermally connected to the CPU of you don't have any fans. Perhaps the fan grilles are clogged with dust. You should be able to run you CPU at full speed continuously. — Borodin, Mar 25 '16 at 21:23
Really? I used `split` to break up the 850Mb file into ~ 80Mb pieces and your script and mine both produce the same results ... except for speed :-\ See changes above. I will remove the answer if it s not helpful. — G. Cito, Mar 25 '16 at 22:26
Sorry, I was reading your first code fragment which has only a single read in `chomp ( $seq = <> ) ` — Borodin, Mar 25 '16 at 23:02

Ævar Arnfjörð Bjarmason · Answer 4 · 2016-03-25T23:28:44.307

2

Generally speaking Perl is really slow at character-by-character processing solutions like those posted above, it's much faster at something like regular expressions since essentially your overhead is mainly how many operators you're executing.

So if you can turn this into a regex-based solution that's much better.

Here's an attempt to do that:

$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; for my $pos (0..4) { $str =~ s/^.// if $pos; say for $str =~ m/(.{5})/g }'|sort|uniq -c|sort -nr|head -n 5
  3 ebfeb
  2 febfe
  2 bfebf
  1 gbacb
  1 fabcg

I.e. we have our string in $str, and then we pass over it 5 times generating sequences of 5 characters, after the first pass we start chopping off a character from the front of the string. In a lot of languages this would be really slow since you'd have to re-allocate the entire string, but perl cheats for this special case and just sets the index of the string to 1+ what it was before.

I haven't benchmarked this but I bet something like this is a much more viable way to do this than the algorithms above, you could also do the uniq counting in perl of course by incrementing a hash (with the /e regex option is probably the fastest way), but I'm just offloading that to |sort|uniq -c in this implementation, which is probably faster.

A slightly altered implementation that does this all in perl:

$ perl -wE 'my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb"; my %occur; for my $pos (0..4) { substr($str, 0, 1) = "" if $pos; $occur{$_}++ for $str =~ m/(.{5})/gs }; for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) { say "$occur{$k} $k" }'
3 ebfeb
2 bfebf
2 febfe
1 caebf
1 cgbac
1 bdbbc
1 acbde
1 efabc
1 aebfe
1 ebdbb
1 fabcg
1 bacbd
1 bcdef
1 cbdeb
1 defab
1 debdb
1 gbacb
1 bdebd
1 cdefa
1 bbcae
1 bcgba
1 bcaeb
1 abcgb
1 abcde
1 dbbca

Pretty formatting for the code behind that:

my $str = "abcdefabcgbacbdebdbbcaebfebfebfeb";
my %occur;
for my $pos (0..4) {
    substr($str, 0, 1) = "" if $pos;
    $occur{$_}++ for $str =~ m/(.{5})/gs;
}

for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) {
    say "$occur{$k} $k";
}

edited Mar 25 '16 at 23:28

answered Mar 25 '16 at 23:10

Ævar Arnfjörð Bjarmason

149
1
5

66s running this against the 83million char string file I made :-) Nice ++ – G. Cito Mar 25 '16 at 23:46
1

@G.Cito: I'd like to see your code: I chose `substr` for a reason. I *did* benchmark it and `substr` came out 3 times faster than a more simple regex solution. This technique of doing it in five passes comes out 400 times slower than using `substr`. I'm using a 500KB string. – Borodin Mar 25 '16 at 23:56
*"I bet something like this is a much more viable way to do this"* and *"I'm just offloading that to `|sort|uniq -c` in this implementation, which is probably faster"* It doesn't really help to post wild guesses like that, and you have a strange idea of what makes things faster. Spawning multiple processes and passing the data as text via pipes is really going to struggle compared with an in-memory `uniq` and `sort` within the same process. And `substr` being far less flexible than regexes is likely to make it faster. My test shows your method to be *vastly* slower by two orders of magnitiude – Borodin Mar 26 '16 at 00:03
1

This brings it within the same order of magnitude. On my test dataset your version takes ~2.2s, mine ~3.2s: `perl -wE 'my $str = <>; my %occur; for my $pos (0..4) { my $cp = $str; substr($cp, 0, $pos) = "" if $pos; $cp =~ s/(.{5})/$1\n/g; ++$occur{$_} for split /\n/, $cp } my @kmers = sort { $occur{$b} <=> $occur{$a} } keys %occur; print "$_ $occur{$_}\n" for @kmers[0..4];' – Ævar Arnfjörð Bjarmason Mar 26 '16 at 00:25
@Borodin I used a very slightly modified version of what @Ævar Arnfjörð Bjarmason posted earlier: `/usr/bin/time perl -wE 'chomp( my $str = <> ) ; my %occur; for my $pos (0..4) { $str =~ s/^.// if $pos; no warnings "void"; $str =~ s/(.{5})/++$occur{$1}/gero }; for my $k (sort { $occur{$b} <=> $occur{$a} } keys %occur) { state $i=0; $i++; last if $i == 5 ; say "$occur{$k} $k" }' xaa` – G. Cito Mar 26 '16 at 00:29
Sorry, stupid SO comment limits, generally faster. In this case piping to `sort|uniq` was also slower, but in many cases it isn't, because if you're CPU limited and your perl process is at a 100% on one core it makes more sense to offload processing to other cores. – Ævar Arnfjörð Bjarmason Mar 26 '16 at 00:30
And generally speaking regexes aren't slower than the likes of substr because they're more general, on the contrary they're usually faster because they can avoid perl's runloop. – Ævar Arnfjörð Bjarmason Mar 26 '16 at 00:34
1

You really shouldn't throw these notions around as facts without a benchmark to support them. Regexes are *not* faster than `substr` in a loop; you've just done a test yourself and found that a basic `for` loop plus `substr` is faster (I measured twice as fast) than your own global regex match. Since you've already optimised your code by doing five passes I doubt if you can make it go any faster, and I can't imagine what circumstances could be more advantageous to a regex solution. – Borodin Mar 26 '16 at 01:14
Likewise, it's a strange idea that serialising and deserialising the data several times to pass it through multiple pipes to new child processes is faster because it gains access to more CPU cores. Again, it is hard to imagine why Perl hashes are faster than a shell command chain in just this if they are slower in general: that really doesn't make sense. If you can come up with a demonstration of any of these assertions then please write them up put them here. – Borodin Mar 26 '16 at 01:14
1

I think you're ignoring the fact that the OP is asking how to do this for input files of 10-1000GB. – ThisSuitIsBlackNot Mar 27 '16 at 04:44

Count subsequences in hundreds of GB of data

4 Answers4

output

Linked