I am trying to improve the speed of a computational (biological) model written in C++ (previous version is on my github: Prokaryotes). The most time-consuming function is where I calculate binding affinities between transcription factors and binding sites on a single genome.
Background: In my model, binding affinity is given by the Hamming distance between the binding domain of a transcription factor (a 20-bool array) and the sequence of a binding site (also a 20-bool array). For a single genome, I need to calculate the affinities between all active transcription factors (typically 5-10) and all binding sites (typically 10-50). I do this every timestep for more than 10,000 cells in the population to update their gene expression states. Having to calculate up to half a million comparisons of 20-bool arrays to simulate just one timestep of my model means that typical experiments take several months (2M--10M timesteps).
For the previous version of the model (link above) genomes remained fairly small, so I could calculate binding affinities once for every cell (at birth) and store and re-use these numbers during the cell's lifetime. However, in the latest version, genomes expand considerably and multiple genomes reside within the same cell. Thus, storing affinities of all transcript factor--binding site pairs in a cell becomes impractical.
In the current implementation I defined an inline function belonging to the Bead class (which is a base class for transcription factor class "Regulator" and binding site class "Bsite"). It is written directly in the header file Bead.hh:
inline int Bead::BindingAffinity(const bool* sequenceA, const bool* sequenceB, int seqlen) const
{
int affinity = 0;
for (int i=0; i<seqlen; i++)
{
affinity += (int)(sequenceA[i]^sequenceB[i]);
}
return affinity;
}
The above function accepts two pointers to boolean arrays (sequenceA
and sequenceB
), and an integer specifying their length (seqlen
). Using a simple for-loop I then check at how many positions the arrays differ (sequenceA[i]^sequenceB[i]
), summing into the variable affinity
.
Given a binding site (bsite
) on the genome, we can then iterate through the genome and for every transcription factor (reg
) calculate its affinity to this particular binding site like so:
affinity = (double)reg->BindingAffinity(bsite->sequence, reg->sequence);
So, this is how streamlined I managed to make it; since I don't have a programming background, I wonder whether there are better ways to write the above function or to structure the code (i.e. should BindingAffinity be a function of the base Bead class)? Suggestions are greatly appreciated.