2

I want to know what is the fastest way to count the number of set bits (1's) in binary file in Perl I need it to fast because I'm reading 10's of files, each one with ~50 million bits.

the way I'm doing it right now is too slow, and the it takes couple of hours to run over 10-15 files.

this is how I do it right now (I know it's slow an inefficient, but in the past the files were much smaller and this method was good enough):

#count number of 1's in binary vector
sub get_DET_fault_count {
    my $bin_vec = shift;

    my $tmp_vec = generate_tmp_path("bin_vec");
    io($tmp_vec)->println( unpack( "B*", $bin_vec ) );
    my $fault_count = `grep -o -E '1' $tmp_vec | wc -l`;

    chomp $fault_count;
    `rm $tmp_vec`;
    return $fault_count;
}
Soft_eng_1989
  • 43
  • 1
  • 3
  • 1
    I'm too curious: What do you need that information for? Is it kind of homework? After all, I could not have thought of a slower version to implement this:-p – Daniel Böhmer Jan 20 '15 at 13:45
  • 2
    You could precompute a Hash that contains a mapping from the value of the byte (in decimal) to the number of set bits. (e.g. 5 => 3) Then with each byte you read you merely need to perform a table lookup. – Hunter McMillen Jan 20 '15 at 13:47
  • no it's not homework. and I know it's the slowest version, but in the past, these files were 20K-30K. so I didn't waste time on searching for something faster :P. now I need it. – Soft_eng_1989 Jan 20 '15 at 13:49
  • 1
    So, does this mean it is a secret what this for? I am as well extremely curious. ^^ It seems like such a basic task "Count the ones". Yet i can not imagine a situation where i would need this except for maybe one or two bytes where one would count some sort of flags but for some reason not care which ones where set, just the number... – DeVadder Jan 20 '15 at 14:56

2 Answers2

8

There are two ways I can think of: 1) use unpack as you are already doing, but don't waste cycles doing any IO. 2) use a lookup table with precomputed values for how many bits are in a given byte

1) The trick here is the '%' directive to unpack which tells unpack to perform a checksum across the result, which sums all of the 0s and 1s in the case of binary data

use strict;
use warnings;

my $filename = $ARGV[0];

open(my $fh, '<', $filename) or die "$!";
binmode $fh;


my $count = 0;
my $word  = 0;

while ( read $fh, $word, 4 ) {
   $count += unpack '%32B*', $word;
}

print "filename contains $count set bits\n";
__END__
7733485

2) The values from 0 - 255 only have a certain number of set bits, which will never change, so you can precompute an array to hold all of them. You waste a little memory -- around 4k or 8k depending on the build -- to prevent any computation except for lookups.

use strict;
use warnings;

my $filename = $ARGV[0];

open(my $fh, '<', $filename) or die "$!";
binmode $fh;

my @bitcounts = (
   0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 
   3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 
   3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 
   2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 
   3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 
   5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 
   2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 
   4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
   3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 
   4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 
   5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 
   5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
);

my $count = 0;
my $byte  = 0;
while ( read $fh, $byte, 1 ) {
   $count += $bitcounts[ord($byte)];
}

print "filename contains $count set bits\n";
__END__
7733485

Both of these approaches give me 7733485 for a sample JPEG I processed.

ikegami
  • 367,544
  • 15
  • 269
  • 518
Hunter McMillen
  • 59,865
  • 24
  • 119
  • 170
  • There's a `count1` typo in your code. Also why initialize `$word`/`$byte` with `0`? Did you compare the speed? – Daniel Böhmer Jan 20 '15 at 14:29
  • Thanks, I wrote both solutions in the same file and used different counters, forgot to update that one. – Hunter McMillen Jan 20 '15 at 14:30
  • I tested them here are the result: --time ./binary_count_1.pl dummy_85Mbit.bin -- --Det count 8408334-- --1.112u 0.004s 0:01.11 100.0% 0+0k 0+0io 0pf+0w-- --time ./binary_count_2.pl dummy_85Mbit.bin-- --Det count 8408334-- --3.104u 0.008s 0:03.11 99.6% 0+0k 0+0io 0pf+0w-- which means, the first one is faster. – Soft_eng_1989 Jan 20 '15 at 14:36
  • Yeah, probably because it reads 4 bytes at a time instead of 1. You could modify the second one to read 4 bytes at a time then pass each byte to the lookup table separately and it should speed up quite a lot – Hunter McMillen Jan 20 '15 at 14:46
  • Nice answer. I really like the trick from 1) but sadly i still cannot imagine a situation in which i could ever use my new-found knowledge. ^^ – DeVadder Jan 20 '15 at 14:53
  • imagine that you have a test that checks number of systems in particular order, and print out the status of each system (pass/fail). let's say you ran 100 different tests on these system in the same order, and now, you want to count how many "pass" you get from all these test (two tests may give you the same result for a specific system, but we don't want to count it twice). so you can convert these results to binary, and run binary OR between them and save a lot of time.. finally , you want to count the number of the 1's. the same if you want to know how many extra "pass" you have foreach tst – Soft_eng_1989 Jan 20 '15 at 15:47
  • @Soft_eng_1989 Well, okay, that implies whatever you people create is quite thoroughly tested, with each file having 50 million bits set. ^^ My inability to come up with that myself probably says more about me than it does about you though. – DeVadder Jan 20 '15 at 16:02
  • 1
    Possibly optimized second version: `my $count = 0; while ( sysread $fh, my $buf, 64*1024) { $count += $bitcounts[$_] for unpack 'C*', $buf; } ` – ikegami Jan 20 '15 at 16:36
2

You could do a loop that keeps shifting the bytes right (using the >> operator) and check if the lowest bit is set.

Something like this:

do {
   $counter++ if $bin_vec & 1;
   $bin_vec >> 1;
} while ($bin_vec > 0);
neuhaus
  • 3,886
  • 1
  • 10
  • 27