Modulo optimization

Question

I have a C program which does some extensive swapping operations on a large array. It has a modulo operation in its tight loop. In fact there is an integer in the range [-N|N[ with N a power of 2 and it should be wrapped to [0,N[.

Example with N=4: -4 => 0, -3 => 1, -2 => 2, -1 => 3, 0 => 0, ..., 3 => 3

At first I tried the version 1 below but was surprised that version 2 is actually notably faster even though it has a conditional expression.

Can you explain why version 2 is faster than version 1 for this special case?

Version 1:

#define N (1<<(3*5))

inline int modBitAnd(int x)
{
    return (x & (N-1));
}

Runtime: 17.1 seconds (for the whole program)

Version 2:

inline int modNeg1(int x)
{
    return (x < 0 ? x + N : x);
}

Runtime: 14.6 seconds (for the whole program)

Program is compiled on GCC 4.8.2. with -std=c99 -O3.

Edit:

Here is the main loop in my program:

int en(uint16_t* p, uint16_t i, uint16_t v)
{
    uint16_t n1 = p[modNeg1((int)i - 1)];
    uint16_t n2 = p[modBitAnd((int)i + 1)];
    uint16_t n3 = p[modNeg1((int)i - C_WIDTH)];
    uint16_t n4 = p[modBitAnd((int)i + C_WIDTH)];
    return d(n1,v) + d(n2,v) + d(n3,v) + d(n4,v);
}

void arrange(uint16_t* p)
{
    for(size_t i=0; i<10000000; i++) {
        uint16_t ia = random(); // random integer [0|2^15[
        uint16_t va = p[ia];
        uint16_t ib = random(); // random integer [0|2^15[
        uint16_t vb = p[ib];
        if(en(p,ia,vb) + en(p,ib,va) < en(p,ia,va) + en(p,ib,vb)) {
            p[ia] = vb;
            p[ib] = va;
        }
    }
}

int d(uint16_t a, uint16_t b) is a distance function e.g. abs((int)a-(int)b).

This is how p is initialized:

uint16_t* p = malloc(sizeof(uint16_t)*N);
for(unsigned i=0; i<N; i++) *p++ = i;

First I used modBitAnd everywhere, but found out that the modNeg1 is acutally faster for the two cases where it can be used.

Have you tried looking at the assembly code produced by the compiler? — Medinoc, Jul 16 '14 at 10:13
the 2nd version doesn't work correctly unless you pass to it a remainder result because there's no modulo happens in it — phuclv, Jul 16 '14 at 10:20
Which platform (32 or 64 bits)? Also, please [post a complete code example](http://meta.stackoverflow.com/a/258849/509868) or a disassembly listing. — anatolyg, Jul 16 '14 at 10:48
@LưuVĩnhPhúc: I think it works fine. Please read the specifications carefully. — Danvil, Jul 16 '14 at 11:41
I note that `random()` should generate the same sequence each time the program is run... unless you have arranged otherwise. Clearly it is important to run the two versions with the same sequence. If you want `ia` to be set to 0..32767 I suggest masking the result of `random()`... as it stands `ia` is set to 0..65535... which `modBitAnd()` will gloss over, but `modNeg1()` will not ! It's not clear to me why `uint16_t va = p[ia];` doesn't crash and burn... unless the area pointed to by `p` is larger than it apparently needs to be from the code fragments shown. — , Jul 16 '14 at 13:13
@gmch: The task I am solving asserts a fixed length of N for `p` and that all values in `p` are in [0,N[. `random` is a mersenne twister which returns an integer in [0,N[ and is also seeded properly at the beginning of the program. — Danvil, Jul 16 '14 at 13:56
for x = 32769 the first will return 1 and the second returns 65537. It's only correct when the return type of the function is int16_t. Also, the cast to int in the distance function is completely unneccessary because any type smaller than int must always be promoted to int in an expression — phuclv, Jul 16 '14 at 14:00
Condition doesn't neccessarily mean that statement is slower because the compiler may use conditional move so there may be no branch — phuclv, Jul 16 '14 at 14:02

score 0 · Answer 1 · edited May 23 '17 at 12:05

0

First take a few stackshots to find out where the time is actually going. Your mod functions will grab some fraction of the samples, but you've also got two calls to random, plus a fair amount of array indexing. Also, it looks like you've got four calls to en with some arguments that are the same, so maybe your modularity is leading to repeat calls to the mod functions.

edited May 23 '17 at 12:05

Community

1
1

answered Jul 16 '14 at 13:23

Mike Dunlavey

40,059
14
91
135

Thank you for your comment. There is definitely a lot of time spent elsewhere. But what I observed is by choosing `modNeg1` instead of `modBitAnd` at the two locations in function `en` the program runs 20% faster. That is what I am wondering about. – Danvil Jul 16 '14 at 13:57
@Danvil: I generally do `mod` with unsigned integers or pure bit masks to get around issues with negative numbers. Regardless, 20% is not a huge speedup, and you can probably do better than that. – Mike Dunlavey Jul 16 '14 at 14:02

Modulo optimization

1 Answers1