1

Similar to question unix - count occurrences of character per line/field but for every character in every position on the line.

Given a file of ~500-characters per 1e7 lines, I want a two dimensional summary structure like $summary{'a','b','c','0','1','2'}[pos 0..499] = count_integer that shows the number of times each character was used in each position of the line. Either order of dimensions is fine.

My first method did ++summary{char}[pos] while reading, but since many lines are the same, it was much faster to count identical lines first, then summarize summary{char}[pos] += n at a time

Are there more idiomatic or faster ways than the following C-like 2d loop?

#!perl 
my ( %summary, %counthash ); # perl 5.8.9

sub method1 {
    print "method1\n";
    while (<DATA>) {
        my @c = split( // , $_ );
        ++$summary{ $c[$_] }[$_] foreach ( 0 .. $#c );
    }    # wend
} ## end sub method1

sub method2 {
    print "method2\n";
    ++$counthash{$_} while (<DATA>);    # slurpsum the whole file

    foreach my $str ( keys %counthash ) {  
        my $n = $counthash{$str};
        my @c = split(//, $str);
        $summary{ $c[$_] }[$_] += $n foreach ( 0 .. $#c );
    }    #rof  my $str
} ## end sub method2

# MAINLINE
if (rand() > 0.5) { &method1 } else { &method2 }
print "char $_ : @{$summary{$_}} \n" foreach ( 'a', 'b' );
# both methods have this output summary
# char a : 3 3 2 2 3 
# char b : 2 2 3 3 2 
__DATA__
aaaaa
bbbbb
aabba
bbbbb
aaaaa
Community
  • 1
  • 1
jgraber
  • 11
  • 1
  • 2
    It's quite hard to visualise what you're looking for with that sample data - I assume your scenario isn't quite as trivial as a line full of repeated characters? Also: `use strict; use warnings;` is a really good idea. – Sobrique Dec 07 '15 at 18:21
  • The only inefficiency/non-idiomaticity(?) I see is that you're counting all the line-termination characters (newlines and/or CRs) as well. (Perl includes them in `$_` unless you do something.) Stick in a `chomp;` after each `` read. – Jeff Y Dec 07 '15 at 18:57
  • @JeffY: *unidiomaticity*, I believe – Borodin Dec 07 '15 at 22:12
  • Are these DNA sequences? – Borodin Dec 07 '15 at 22:16
  • The real data is TDL, a form of VHDL vector using characters HLCM01Z, and I'm looking for which pins/columns are used vs static. I have use warning; use strict; in the real program, but I neglected to include them in the sample program for posting. Sobrique. Jeff Y Borodin – jgraber Dec 08 '15 at 15:58

1 Answers1

1

Depending on how your data is formed method2 might be a bit faster or slower than method 1.

But a big difference would be to use unpack instead of split.

use strict;
use warnings;
my ( %summary, %counthash ); # perl 5.8.9

sub method1 {
    print "method1\n";
    my @l= <DATA>;
    for  my $t(1..1000000) {
        foreach (@l) {
            my @c = split( // , $_ );
            ++$summary{ $c[$_] }[$_] foreach ( 0 .. $#c );
        }    
    }    # wend
} ## end sub method1

sub method2 {
    print "method2\n";
    ++$counthash{$_} while (<DATA>);    # slurpsum the whole file
    for  my $t(1..1000000) {
        foreach my $str ( keys %counthash ) {  
            my $n = $counthash{$str};
            my $i = 0;
            $summary{ $_ }[$i++] += $n foreach ( unpack("c*",$str) );
        }    
    }
} ## end sub method2

# MAINLINE
#method1();
method2();
print "char $_ : ". join (" ", @{$summary{ord($_)}}). " \n"
    foreach ( 'a', 'b' );
# both methods have this output summary
# char a : 3 3 2 2 3 
# char b : 2 2 3 3 2 
__DATA__
aaaaa
bbbbb
aabba
bbbbb
aaaaa

runs much faster. (6 instead of 7.x seconds on my pc)

Georg Mavridis
  • 2,312
  • 1
  • 15
  • 23
  • Did you test that? {unpack("c*",$str)} generates the wrong summary keys of 98 and 97, rather than 'a' and 'b'; 'a*' does not work;THIS WORKS: $summary{ $_ }[$i++] += $n foreach ( unpack('a' x length($str),$str) ); THIS ALSO WORKS $summary{ chr($_)}[$i++] += $n foreach ( unpack('c*',$str) ); – jgraber Dec 08 '15 at 17:47
  • $summary{ substr($str,$_,1) }[$_] += $n foreach ( 0..(length($str)-1)); # IS EQUALLY AS FAST – jgraber Dec 08 '15 at 18:53
  • @jgrabber yes i did and it worked. unpack just returns the invert of chr so in my code i print sumary{ord($_)} as you might have noticed... But.. the solution with length and substring is even faster. The original code (executed a million time) takes 7.177 secs on my pc, the solution with unpack takes 5.879 secs and the solution with length and substring takes only 4.286 secs. – Georg Mavridis Dec 09 '15 at 09:31
  • > so in my code i print sumary{ord($_)} as you might have noticed. – jgraber Dec 09 '15 at 17:45
  • comment timeout : @Georg Just 1 b in jgraber : Not until you pointed it out did I notice the ord in the print. In a similar application to get data for just user-input columns, the fastest approach I have seen is building the code for a bunch of substr() into a string, then evalling it to get a compiled subroutine, then call that. Re perlidioms, [link](http://stackoverflow.com/a/1868490/5650997) demonstrates use of map for a similar loop. Maybe not until perl6 would there be a way to add to 2d slices; @summary{ slice} [0..m] ^+= $n x (slice * m); – jgraber Dec 09 '15 at 17:53