Optimising comparisons of boolean arrays in C++

Question

I am trying to improve the speed of a computational (biological) model written in C++ (previous version is on my github: Prokaryotes). The most time-consuming function is where I calculate binding affinities between transcription factors and binding sites on a single genome.

Background: In my model, binding affinity is given by the Hamming distance between the binding domain of a transcription factor (a 20-bool array) and the sequence of a binding site (also a 20-bool array). For a single genome, I need to calculate the affinities between all active transcription factors (typically 5-10) and all binding sites (typically 10-50). I do this every timestep for more than 10,000 cells in the population to update their gene expression states. Having to calculate up to half a million comparisons of 20-bool arrays to simulate just one timestep of my model means that typical experiments take several months (2M--10M timesteps).

For the previous version of the model (link above) genomes remained fairly small, so I could calculate binding affinities once for every cell (at birth) and store and re-use these numbers during the cell's lifetime. However, in the latest version, genomes expand considerably and multiple genomes reside within the same cell. Thus, storing affinities of all transcript factor--binding site pairs in a cell becomes impractical.

In the current implementation I defined an inline function belonging to the Bead class (which is a base class for transcription factor class "Regulator" and binding site class "Bsite"). It is written directly in the header file Bead.hh:

inline int Bead::BindingAffinity(const bool* sequenceA, const bool* sequenceB, int seqlen) const
{
    int affinity = 0;
    for (int i=0; i<seqlen; i++)
    {
        affinity += (int)(sequenceA[i]^sequenceB[i]);
    }
    return affinity;
}

The above function accepts two pointers to boolean arrays (sequenceA and sequenceB), and an integer specifying their length (seqlen). Using a simple for-loop I then check at how many positions the arrays differ (sequenceA[i]^sequenceB[i]), summing into the variable affinity.

Given a binding site (bsite) on the genome, we can then iterate through the genome and for every transcription factor (reg) calculate its affinity to this particular binding site like so:

affinity = (double)reg->BindingAffinity(bsite->sequence, reg->sequence);

So, this is how streamlined I managed to make it; since I don't have a programming background, I wonder whether there are better ways to write the above function or to structure the code (i.e. should BindingAffinity be a function of the base Bead class)? Suggestions are greatly appreciated.

You would have to benchmark this, but I expect that these calculations are faster when you store each boolean sequence of length 20 in a 32bit unsigned integer and then call `std::popcount(seq1^seq2);` If you can avoid storing them as boolean arrays at all so that no array->int conversion is necessary, this could be much faster and save some RAM (depending on bool storage) — eike, Dec 21 '21 at 14:18
From the text, I would assume 20: "sequence of a binding site (also a 20-bool array)" — eike, Dec 21 '21 at 14:48
You could try `std::bitset<20>` instead of the arrays, apply `^=` on the two bitsets, and then issue a `bitset<20>::count()` on that value to get the number of bits that are set. — PaulMcKenzie, Dec 21 '21 at 14:52
[Sort of like this](http://coliru.stacked-crooked.com/a/249bd416b854d1f2). But again, you should profile the code. Note that the first parameter is passed by value, since `^=` is destructive for the bitset on the left-hand side, so a copy is made. But since it's only 20 bits, the copy should be negligible. — PaulMcKenzie, Dec 21 '21 at 15:03
@PaulMcKenzie: Thanks, your idea looks very promising (playing around in a c++ shell)! I will soon try implementing it my model. — SlamDunkTheFunk, Dec 22 '21 at 09:48
@eike: Thanks for the suggestion; I though about this as well, but one difficulty is that I want these bitstrings to mutate in my model. With a small rate, single bits flip in newborn cells. I wouldn't know how to translate this to altering the integer value. Since bitstring mutations are rare (compared to bitstring comparisons for updating gene expression), it could still be a valuable route to convert the int to a bool array and back just for the mutation.. — SlamDunkTheFunk, Dec 22 '21 at 10:00
@SamvonderDunk How are you mutating these sequences? By flipping n random bits of a sequence or by flipping each bit of a sequence with a given probability? In general, you can flip bit `n` of sequence `seq` by calculating `seq = seq ^ (1 << n)`. — eike, Dec 22 '21 at 11:35
With bitset you are at the mercy of the standard library implementation when it comes to space and performance cost. You should definitely profile and if your software is also built by third parties, you should profile using a variety of compilers as well. — eike, Dec 22 '21 at 11:42
@eike: I am indeed flipping each bit with a given probability. So I guess the the only thing for me to find out is whether unnecessarily comparing the 12 leading bits of 32-bit integers compromises the speed relative to comparing length-20 bitsets for my particular compiler. I will check it out, thanks! — SlamDunkTheFunk, Dec 22 '21 at 13:19
@SamvonderDunk Bit operations on two 32bit integers are single OPs on pretty much every current CPU architecture, so you can't really be faster than that. Bitsets might be equally fast (because the implementation may do the same thing), but I wouldn't expect them to be faster. The code using bitsets may be nicer to read/write for someone not too familiar with bitwise operations though. — eike, Dec 22 '21 at 13:27

score 0 · Answer 1 · answered Jan 10 '22 at 10:17

Thanks to @PaulMcKenzie and @eike for your suggestions. I tested both ideas against my previous implementation. Below are the results. In short, both answers work very well.

My previous implementation yielded an average runtime of 5m40 +/- 7 (n=3) for 1000 timesteps of the model. Profiling analysis with GPROF showed that the function BindingAffinity() took 24.3% of total runtime. [see Question for the code].

The bitset implementation yielded an average runtime of 5m11 +/- 6 (n=3), corresponding to a ~9% speed increase. Only 3.5% of total runtime is spent in BindingAffinity().

//Function definition in Bead.hh
inline int Bead::BindingAffinity(std::bitset<regulator_length> sequenceA, const std::bitset<regulator_length>& sequenceB) const
{
    return (int)(sequenceA ^= sequenceB).count();
}

//Function call in Genome.cc
affinity = (double)reg->BindingAffinity(bsite->sequence, reg->sequence);

The main downside of the bitset implementation is that unlike with boolean arrays (my previous implementation), I have to specify the length of the bitset that goes into the function. I am occasionally comparing bitsets of different lengths, so for these I now have to specify separate functions (templates would not work for multi-file project according to https://www.cplusplus.com/doc/oldtutorial/templates/).

For the integer implementation I tried two alternatives to the std::popcount(seq1^seq2) function suggested by @eike since I am working with an older version of C++ that doesn't include this.

Alternative #1:

inline int Bead::BindingAffinity(int sequenceA, int sequenceB) const
{
    int i = sequenceA^sequenceB;
    std::bitset<32> bi (i);
    return ((std::bitset<32>)i).count();
}

Alternative #2:

inline int Bead::BindingAffinity(int sequenceA, int sequenceB) const
{
    int i = sequenceA^sequenceB;

    //SWAR algorithm, copied from https://stackoverflow.com/questions/109023/how-to-count-the-number-of-set-bits-in-a-32-bit-integer
    i = i - ((i >> 1) & 0x55555555);        // add pairs of bits
    i = (i & 0x33333333) + ((i >> 2) & 0x33333333);  // quads
    i = (i + (i >> 4)) & 0x0F0F0F0F;        // groups of 8
    return (i * 0x01010101) >> 24;          // horizontal sum of bytes
}

These yielded average runtimes of 5m06 +/- 6 (n=3) and 5m06 +/- 3 (n=3), respectively, corresponding to a ~10% speed increase compared to my previous implementation. I only profiled Alternative #2, which showed that only 2.2% of total runtime was spent in BindingAffinity(). The downside of using integers for bitstrings is that I have to be very careful whenever I change any of the code. Single-bit mutations are definitely possible as mentioned by @eike, but everything is just a little bit trickier.

Conclusion: Both the bitset and integer implementations for comparing bitstrings achieve impressive speed improvements. So much so, that BindingAffinity() is no longer the bottleneck in my code.

Optimising comparisons of boolean arrays in C++

1 Answers1