3

we are working on a model checking tool which executes certain search routines several billion times. We have different search routines which are currently selected using preprocessor directives. This is not only very unhandy as we need to recompile every time we make a different choice, but also makes the code hard to read. It's now time to start a new version and we are evaluating whether we can avoid conditional compilation.

Here is a very artificial example that shows the effect:

/* program_define */

#include <stdio.h>
#include <stdlib.h>

#define skip 10

int main(int argc, char** argv) {
    int i, j;
    long result = 0;

    int limit = atoi(argv[1]);

    for (i = 0; i < 10000000; ++i) {
        for (j = 0; j < limit; ++j) {
            if (i + j % skip == 0) {
                continue;
            }
            result  += i + j;
        }
    }

    printf("%lu\n", result);
    return 0;
}

Here, the variable skip is an example for a value that influences the behavior of the program. Unfortunately, we need to recompile every time we want a new value of skip.

Let's look at another version of the program:

/* program_variable */

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    int i, j;
    long result = 0;

    int limit = atoi(argv[1]);
    int skip = atoi(argv[2]);

    for (i = 0; i < 10000000; ++i) {
        for (j = 0; j < limit; ++j) {
            if (i + j % skip == 0) {
                continue;
            }
            result  += i + j;
        }
    }

    printf("%lu\n", result);
    return 0;
}

Here, the value for skip is passed as a command line parameter. This adds great flexibility. However, this program is much slower:

$ time ./program_define 1000 10
50004989999950500

real 0m25.973s
user 0m25.937s
sys  0m0.019s

vs.

$ time ./program_variable 1000 10
50004989999950500

real 0m50.829s
user 0m50.738s
sys  0m0.042s

What we are looking for is an efficient way to pass values into a program (by means of a command line parameter or a file input) that will never change afterward. Is there a way to optimize the code (or tell the compiler to) such that it runs more efficiently?

Any help is greatly appreciated!


Comments:

As Dirk wrote in his comment, it is not about the concrete example. What I meant was a way to replace an if that evaluates a variable that is set once and then never changed (say, a command line option) inside a function that is called literally billions of times by a more efficient construct. We currently use the preprocessor to tailor the desired version of the function. It would be nice if there is a nicer way that does not require recompilation.

Community
  • 1
  • 1
Niels Lohmann
  • 2,054
  • 1
  • 24
  • 49
  • 1
    maybe you shouldn't use pre-processor directives that are constantly changing and reformat your code? – L7ColWinters Jan 18 '12 at 16:23
  • 1
    Interestingly, the performance of the two examples is virtually indistinguishable on my machine (64-bit Ubuntu, `gcc` 4.4.3 with `-O3`). – NPE Jan 18 '12 at 16:24
  • 1
    The difference is almost certainly due to the fact that if `skip` is known at compile time, the compiler can produce better code for `j % skip` using only multiplications and shifts. – Daniel Fischer Jan 18 '12 at 16:26
  • The bad news is that division (and `%` counts as division) often is slower in cases where the optimizer doesn't know the exact value. There are tricks you can play to replace division by a particular value with one more more multiplications/adds/shifts, if those work out faster. In principle, you don't necessarily have to recompile the entire program to re-run the optimizer for a new value, if you had a suitable intermediate form just before optimization. You could try using a compiler with link-time optimization, see whether it can optimize when you re-link against a new value. – Steve Jessop Jan 18 '12 at 16:51
  • 1
    Oh, and this may sound odd, but you could try it in C# or Java. Since they JIT-compile, they optimize later than a typical C++ toolchain, so you might get lucky. The trick is to work out how to supply the value after compilation to byte code, but before the code is optimized. – Steve Jessop Jan 18 '12 at 16:54
  • If Skip is a power of two *and* known at compile time (in java or .net jit-compile time), the modulo can be optimized into a simple bitwise operation which is magnitudes faster.. – codymanix Jan 19 '12 at 08:23

11 Answers11

5

You can take a look at libdivide which works to do fast division when the divisor isn't known until runtime: (libdivide is an open source library for optimizing integer division).

If you calculate a % b using a - b * (a / b) (but with libdivide) you might find that it's faster.

Edgar
  • 6,022
  • 8
  • 33
  • 66
Sophie Alpert
  • 139,698
  • 36
  • 220
  • 238
2

I ran your program_variable code on my system to get a baseline of performance:

$ gcc -Wall test1.c
$ time ./a.out 1000 10
50004989999950500

real    0m55.531s
user    0m55.484s
sys     0m0.033s

If I compile test1.c with -O3, then I get:

$ time ./a.out 1000 10
50004989999950500

real    0m54.305s
user    0m54.246s
sys     0m0.030s

In a third test, I manually set the values of limit and skip:

int limit = 1000, skip = 10;

I then re-run the test:

$ gcc -Wall test2.c
$ time ./a.out
50004989999950500

real    0m54.312s
user    0m54.282s
sys     0m0.019s

Taking out the atoi() calls doesn't make much of a difference. But if I compile with -O3 optimizations turned on, then I get a speed bump:

$ gcc -Wall -O3 test2.c
$ time ./a.out
50004989999950500

real    0m26.756s
user    0m26.724s
sys     0m0.020s

Adding a #define macro for an ersatz atoi() function helped a little, but didn't do much:

#define QSaToi(iLen, zString, iOut) {int j = 1; iOut = 0; \
    for (int i = iLen - 1; i >= 0; --i) \
        { iOut += ((zString[i] - 48) * j); \
        j = j*10;}}

...
int limit, skip;
QSaToi(4, argv[1], limit);
QSaToi(2, argv[2], skip);

And testing:

$ gcc -Wall -O3 -std=gnu99 test3.c
$ time ./a.out 1000 10
50004989999950500

real    0m53.514s
user    0m53.473s
sys     0m0.025s

The expensive part seems to be those atoi() calls, if that's the only difference between -O3 compilation.

Perhaps you could write one binary, which loops through tests of various values of limit and skip, something like:

#define NUM_LIMITS 3
#define NUM_SKIPS 2
...
int limits[NUM_LIMITS] = {100, 1000, 1000};
int skips[NUM_SKIPS] = {1, 10};
int limit, skip;
...
for (int limitIdx = 0; limitIdx < NUM_LIMITS; limitIdx++)
    for (int skipIdx = 0; skipIdx < NUM_SKIPS; skipIdx++)
        /* per-limit, per-skip test */

If you know your parameters ahead of compilation time, perhaps you can do it this way. You could use fprintf() to write your output to a per-limit, per-skip file output, if you want results in separate files.

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
1

You could try using the GCC likely/unlikely builtins (e.g. here) or profile guided optimization (e.g. here). Also, do you intend (i + j) % 10 or i + (j % 10)? The % operator has higher precedence, so your code as written is testing the latter.

Community
  • 1
  • 1
Matt K
  • 13,370
  • 2
  • 32
  • 51
1

I'm a bit familiar with the program Niels is asking about. There are a bunch of interesting answers around (thanks), but the answers slightly miss the spirit of the question. The given example programs are really just example programs. The logic that is subject to pre-processor statements is much much more involved. In the end, it is not just about executing a modulo operation or a simple division. it is about keeping or skipping certain procedure calls, executing an operation between two other operations etc, defining the size of an array, etc.

All these things could be guarded by variables that are set by command-line parameters. But that would be too costly as many of these routines, statements, memory allocations are executed a billion times. Perhaps that shapes the problem a bit better. Still very interested in your ideas.

Dirk

Dirk
  • 11
  • 1
1

If you would use C++ instead of C you could use templates so that things can be calculated at compile time, even recursions are possible. Please have a look at C++ template meta programming.

codymanix
  • 28,510
  • 21
  • 92
  • 151
0

A stupid answer, but you could pass the define on the gcc command line and run the whole thing with a shell script that recompiles and runs the program based on a command-line parameter

#!/bin/sh
skip=$1
out=program_skip$skip
if [ ! -x $out ]; then 
    gcc -O3 -Dskip=$skip -o $out test.c
fi
time $out 1000
evil otto
  • 10,348
  • 25
  • 38
0

I got also an about 2× slowdown between program_define and program_variable, 26.2s vs. 49.0s. I then tried

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    int i, j, r;
    long result = 0;

    int limit = atoi(argv[1]);
    int skip = atoi(argv[2]);

    for (i = 0; i < 10000000; ++i) {
        for (j = 0, r = 0; j < limit; ++j, ++r) {
            if (r == skip) r = 0;
            if (i + r == 0) {
                continue;
            }
            result  += i + j;
        }
    }

    printf("%lu\n", result);
    return 0;
}

using an extra variable to avoid the costly division, and the resulting time was 18.9s, so significantly better than the modulo with a statically known constant. However, this auxiliary-variable technique is only promising if the change is easily predictable.

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
0

Another possibility would be to eliminate using the modulus operator:

#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    int i, j;
    long result = 0;

    int limit = atoi(argv[1]);
    int skip = atoi(argv[2]);
    int current = 0;

    for (i = 0; i < 10000000; ++i) {
        for (j = 0; j < limit; ++j) {
            if (++current == skip) {
                current = 0;
                continue;
            }
            result  += i + j;
        }
    }

    printf("%lu\n", result);
    return 0;
}
Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
0

If that is the actual code, you have a few ways to optimize it:

(i + j % 10==0) is only true when i==0, so you can skip that entire mod operation when i>0. Also, since i + j only increases by 1 on each loop, you can hoist the mod out and simply have a variable you increment and reset when it hits skip (as has been pointed out in other answers).

MSN
  • 53,214
  • 7
  • 75
  • 105
0

You can also have all possible function implementations already in the program, and at runtime you change the function pointer to select the function which you are actually are using.

You can use macros to avoid that you have to write duplicate code:

#define MYFUNCMACRO(name, myvar) void name##doit(){/* time consuming code using myvar */}

MYFUNCMACRO(TEN,10)
MYFUNCMACRO(TWENTY,20)
MYFUNCMACRO(FOURTY,40)
MYFUNCMACRO(FIFTY,50)

If you need to have too many of these macros (hundreds?) you can write a codegenerator which writes the cpp file automatically for a range of values.

I didn't compile nor test the code, but maybe you see the principle.

codymanix
  • 28,510
  • 21
  • 92
  • 151
-2

You might be compiling without optimisation, which will lead your program to load skip each time it's checked, instead of the literal of 10. Try adding -O2 to your compiler's command line, and/or use

register int skip;
ObjectiveC-oder
  • 342
  • 1
  • 5
  • 2
    `register` may not do anything. The compiler is free to ignore it: http://stackoverflow.com/questions/578202/register-keyword-in-c – Matt K Jan 18 '12 at 16:55