I have a C program which does some extensive swapping operations on a large array. It has a modulo operation in its tight loop. In fact there is an integer in the range [-N|N[ with N
a power of 2 and it should be wrapped to [0,N[.
Example with N=4: -4 => 0, -3 => 1, -2 => 2, -1 => 3, 0 => 0, ..., 3 => 3
At first I tried the version 1 below but was surprised that version 2 is actually notably faster even though it has a conditional expression.
Can you explain why version 2 is faster than version 1 for this special case?
Version 1:
#define N (1<<(3*5))
inline int modBitAnd(int x)
{
return (x & (N-1));
}
Runtime: 17.1 seconds (for the whole program)
Version 2:
inline int modNeg1(int x)
{
return (x < 0 ? x + N : x);
}
Runtime: 14.6 seconds (for the whole program)
Program is compiled on GCC 4.8.2. with -std=c99 -O3.
Edit:
Here is the main loop in my program:
int en(uint16_t* p, uint16_t i, uint16_t v)
{
uint16_t n1 = p[modNeg1((int)i - 1)];
uint16_t n2 = p[modBitAnd((int)i + 1)];
uint16_t n3 = p[modNeg1((int)i - C_WIDTH)];
uint16_t n4 = p[modBitAnd((int)i + C_WIDTH)];
return d(n1,v) + d(n2,v) + d(n3,v) + d(n4,v);
}
void arrange(uint16_t* p)
{
for(size_t i=0; i<10000000; i++) {
uint16_t ia = random(); // random integer [0|2^15[
uint16_t va = p[ia];
uint16_t ib = random(); // random integer [0|2^15[
uint16_t vb = p[ib];
if(en(p,ia,vb) + en(p,ib,va) < en(p,ia,va) + en(p,ib,vb)) {
p[ia] = vb;
p[ib] = va;
}
}
}
int d(uint16_t a, uint16_t b)
is a distance function e.g. abs((int)a-(int)b)
.
This is how p
is initialized:
uint16_t* p = malloc(sizeof(uint16_t)*N);
for(unsigned i=0; i<N; i++) *p++ = i;
First I used modBitAnd
everywhere, but found out that the modNeg1
is acutally faster for the two cases where it can be used.