I have trouble believing that profile result. In this code
16 for (int x = 1; x < w + 1; x++, pg++, ps += n_bins, psu += n_bins) {
17 s += *pg;
18 *ps = *psu + s;
19 }
it says the lion's share of time is on line 18, very little on 17, and next to nothing on line 16.
Yet it is also doing a comparison, two increments, and three adds on every iteration.
Cache-misses might explain it, but there's no harm in double-checking, which I do with this technique.
Regardless, the loop could be unrolled, for example:
int x = w;
while(x >= 4){
s += pg[0];
ps[n_bins*0] = psu[n_bins*0] + s;
s += pg[1];
ps[n_bins*1] = psu[n_bins*1] + s;
s += pg[2];
ps[n_bins*2] = psu[n_bins*2] + s;
s += pg[3];
ps[n_bins*3] = psu[n_bins*3] + s;
x -= 4;
pg += 4;
ps += n_bins*4;
psu += n_bins*4;
}
for(; --x >= 0;){
s += *pg;
*ps = *psu + s;
pg++;
ps += n_bins;
psu += n_bins;
}
If n_bins
happens to be a constant, this could enable the compiler to do some more optimizing of the code in the while
loop.