1

Is it possible to completely remove function call from C code at runtime and insert it back when needed.

I'm not sure if ELF can be modified at run time, so that no cpu cycle is wasted incase of no use of function.

I don't want to place a 'if' check before the function call to avoid calling a function.

For example if global flag g_flg=1 then func1 should look like below

void func1(int x)
{
 /* some processing */

 func2(y);

 /* some processing */

}

if global g_flag=0 then func1 should look like below

void func1(int x)
{
 /* some processing */

  /* some processing */

}
Vaibhav
  • 37
  • 2
  • 6
    Why are you trying to do this? Checking `g_flg` at runtime will be a minute cost. – Benj Oct 02 '12 at 13:41
  • 2
    You could also call `func2()` indirectly, through a pointer. You can set the pointer to point to an empty/dummy function when you don't want `func2()` called. – Alexey Frunze Oct 02 '12 at 13:44
  • You could make it marginally faster than either of the two suggestions above by using a function pointer to call either func1_with_func2 or func1_without_func2. But the whole idea is wrongheaded. 'I don't want to place a 'if' check before the function call to avoid calling a function.' -- Why not? Everyone else does. – Jim Balter Oct 02 '12 at 13:45
  • It's possible that the code is mapped into a readonly page of memory, making it impossible to modify even if you knew how... – trojanfoe Oct 02 '12 at 13:45
  • actually func2 is used for logging debug information which i don't want when system is running fine, hence looking for an alternative of instead putting if check as func2 is a debug logging information in a single call flow its count reach at least 100 which means at every place 100 if checks will be introduced – Vaibhav Oct 02 '12 at 13:55
  • @Vaibhav But _why_ don't you want an `if` there? It's the easiest and cleanest solution, and the run-time cost is a few nanoseconds. – Daniel Fischer Oct 02 '12 at 13:56
  • 2
    If it's a debugging routine, it would be easier to conditionally compile it in with a preprocessor macro. – Wug Oct 02 '12 at 14:02
  • For the sake of argument since you asked about "excluding code at run time", if it's a shared library we can only know if a function is called or returns null through a function pointer at run time. However, it execution time, if included would actually be longer compared to static linking. – fkl Oct 02 '12 at 14:10
  • @Wug : I'm looking for runtime solution – Vaibhav Oct 02 '12 at 14:22
  • @Vaibhav: so assuming you don't re-write the code at runtime, you need to have a section of code in your binary that's executed if `g_flg` is true, but not if it isn't. What makes you think that your compiler will choose *not* to do this in the fastest way possible when you write `if (g_flg) { code; }`? Once you know what kind of compiler stupidity you're trying to work around, you might come up with a specific workaround that beats the compiler's all-purpose behavior. But for this case, unless the compiler is being pretty stupid there won't be one. – Steve Jessop Oct 02 '12 at 16:52
  • So that leaves code morphing at runtime. Which is difficult, but you could for example read the value of `g_flg`, output the resulting source for `func1` to a file, compile it to a dll, open the dll and thus get a function pointer for a suitable version `func1`. Depending how many times `func1` conditionally calls `func2`, this could possibly be a win. But rolling your own code morphing system is way beyond what the gain is worth for most programs. – Steve Jessop Oct 02 '12 at 16:58

3 Answers3

2

Don't optimize something that doesn't need it. Have you tried assessing the potential improvement on your performance?

Try setting g_flg to 1 and execute this:

if (g_flg == 1) {func2(y);}

Then try executing this:

func2(y);

Both 1 million times (or whatever number of times you can run it in a reasonable time). I'm quite sure you'll notice there is virtually no difference between both things.

Plus, apart from that, I think what you want to do is impossible, because ELF is a binary (compiled) format.

Evren Kuzucuoglu
  • 3,781
  • 28
  • 51
  • For something like that it would be important to make sure that the compiler isn't performing any optimizations that it could make in the trivial benchmarking case that it can't in the real world one. – Wug Oct 02 '12 at 14:00
  • Ok makes sense. I'm sure there's a compiler directive to tell it not to optimize anything. – Evren Kuzucuoglu Oct 02 '12 at 14:01
  • I think its -O0 (that's a capital o followed by a zero). This mode is apparently the default, and unless otherwise directed, gcc will not perform any optimizations and will compile the source code exactly as written. – Wug Oct 02 '12 at 14:03
  • Unfortunately it's pretty meaningless to measure performance differences with `-O0`, because then the compiler isn't making the optimizations that it could make in the real-world case, either. The only way to ensure that the compiler makes real-world optimizations but not trivial benchmark-only optimizations, is to enable optimization and give it real-world code and not trivial benchmark code. In this case, you need to make sure the value of `g_flg` isn't available to the optimizer and you need to make sure the code is actually executed at all. – Steve Jessop Oct 02 '12 at 16:36
  • Except in this case, the idea is to prove a line of code is costless; if you can prove that without optimizations the line of code is costless, so will it be with optimizations. – Evren Kuzucuoglu Oct 03 '12 at 08:46
1

What you could probably get away with doing instead would be something like this:

struct Something;
typedef struct Something Something;

int myFunction(Something * me, int i)
{
    // do a bunch of stuff
    return 42; // obviously the answer
}

int myFunctionDoNothing(Something * dummy1, int dummy2)
{
    return 0;
}

int (*function)(Something *, int) = myFunctionDoNothing;

// snip to actual use of function

int i;

function = myFunctionDoNothing;
for (i = 0; i < 100000; ++i) function(NULL, 5 * i); // does nothing

function = myFunction;
for (i = 0; i < 100000; ++i) function(NULL, 5 * i); // does something

WARNING

This might be a premature optimization. Depending on how your compiler treats this and how your cpu handles branching, you might actually lose performance this way as opposed to the naive way (stopping it in the function with a flag)

Wug
  • 12,956
  • 4
  • 34
  • 54
  • 2
    "might actually lose performance" - my instinct is that you're far more likely to lose performance this way than to gain it. In terms of functionality it's a perfectly good way to introduce a point of customization, but introducing a call through a function pointer as an alternative to testing the value of an int is... erm... probably not fast :-) – Steve Jessop Oct 02 '12 at 13:52
  • Might be worth bothering to test. If your CPU branches poorly, it might be an optimization as long as the arbitrary jump doesn't fudge up instruction fetching (my instincts say it shouldn't, because when you call a function you always jump to an address, the fact that you're jumping to an address that changes shouldn't make a huge difference after you've hit it once and it's in the cache). You'd lose out on compiler inlining though, which is a compiler-done optimization that would probably make a much bigger difference. – Wug Oct 02 '12 at 13:53
  • Sure, might as well test it if you have time on your hands. As long as the function pointer doesn't block `func2` from being inlined, the two will probably be fairly similar, so you never know which will edge it. – Steve Jessop Oct 02 '12 at 13:55
  • I think this approach is more specific to a particular CPU and again needs same amount of effort to verify if the CPU changes, hence i don't see the benefit in this approach – Vaibhav Oct 02 '12 at 14:08
  • @Vaibhav: if you think that you can attempt the kind of nano-optimization that your question calls for and *not* have different results on different CPUs, then I think you're probably mistaken. – Steve Jessop Oct 02 '12 at 16:35
  • @Wug: I vaguely recall reading something about an architecture where a jump/call to variable target will stall the pipeline (agreed, not the cache) whereas a fixed target won't. But I can't remember where, so that's apocryphal, and I would guess that CPU would in any case branch "well", not "poorly" :-) It was inlining I was thinking about when I wrote the first comment, I didn't get that part of your comment until after I'd written my second... – Steve Jessop Oct 02 '12 at 16:46
  • @SteveJessop: As long as the switching isn't done frequently you should only take the pipeline stall once (per switch). It's a branch prediction thing. Related: http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array – Wug Oct 02 '12 at 17:36
0

On most desktop and server architectures branching is faster than indirect calls, since they do branch prediction and/or speculative execution. I have never heard of an architecture where indirect call is faster than a single branch. (Jump tables, for switch() statements, have more than one branch, and are therefore a different thing altogether.)

Consider the following microbenchmark I threw together. test.c:

/* test.c */

volatile long test_calls = 0L;
volatile long test_sum = 0L;

void test(long counter)
{
    test_calls++;
    test_sum += counter;
}

work.c:

/* work.c */

void test(long counter);

/* Work function, to be measured */
void test_work(long counter, int flag)
{
    if (flag)
        test(counter);
}

/* Dummy function, to measure call overhead */
void test_none(long counter __attribute__((unused)), int flag __attribute__((unused)) )
{
    return;
}

and harness.c:

#define  _POSIX_C_SOURCE 200809L
#include <unistd.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>
#include <string.h>
#include <stdio.h>

/* From test.c */
extern volatile long test_calls;
extern volatile long test_sum;

/* Dummy function, to measure call overhead */
void test_none(long counter, int flag);

/* Work function, to be measured */
void test_work(long counter, int flag);

/* Timing harness -- GCC x86; modify for other architectures */
struct timing {
    struct timespec  wall_start;
    struct timespec  wall_stop;
    uint64_t         cpu_start;
    uint64_t         cpu_stop;
};

static inline void start_timing(struct timing *const mark)
{
    clock_gettime(CLOCK_REALTIME, &(mark->wall_start));
    mark->cpu_start = __builtin_ia32_rdtsc();
}

static inline void stop_timing(struct timing *const mark)
{
    mark->cpu_stop = __builtin_ia32_rdtsc();
    clock_gettime(CLOCK_REALTIME, &(mark->wall_stop));
}

static inline double cpu_timing(const struct timing *const mark)
{
    return (double)(mark->cpu_stop - mark->cpu_start); /* Cycles */
}

static inline double wall_timing(const struct timing *const mark)
{
    return (double)(mark->wall_stop.tv_sec - mark->wall_start.tv_sec)
         + (double)(mark->wall_stop.tv_nsec - mark->wall_start.tv_nsec) / 1000000000.0;
}

static int cmpdouble(const void *aptr, const void *bptr)
{
    const double a = *(const double *)aptr;
    const double b = *(const double *)bptr;

    if (a < b)
        return -1;
    else
    if (a > b)
        return +1;
    else
        return  0;
}

void report(double *const wall, double *const cpu, const size_t count)
{
    printf("\tInitial call: %.0f cpu cycles, %.9f seconds real time\n", cpu[0], wall[0]);

    qsort(wall, count, sizeof (double), cmpdouble);
    qsort(cpu, count, sizeof (double), cmpdouble);

    printf("\tMinimum:      %.0f cpu cycles, %.9f seconds real time\n", cpu[0], wall[0]);
    printf("\t5%% less than  %.0f cpu cycles, %.9f seconds real time\n", cpu[count/20], wall[count/20]);
    printf("\t25%% less than %.0f cpu cycles, %.9f seconds real time\n", cpu[count/4], wall[count/4]);
    printf("\tMedian:       %.0f cpu cycles, %.9f seconds real time\n", cpu[count/2], wall[count/2]);
    printf("\t75%% less than %.0f cpu cycles, %.9f seconds real time\n", cpu[count-count/4-1], wall[count-count/4-1]);
    printf("\t95%% less than %.0f cpu cycles, %.9f seconds real time\n", cpu[count-count/20-1], wall[count-count/20-1]);
    printf("\tMaximum:      %.0f cpu cycles, %.9f seconds real time\n", cpu[count-1], wall[count-1]);
}

int main(int argc, char *argv[])
{
    struct timing    measurement;
    double      *wall_seconds = NULL;
    double      *cpu_cycles = NULL;
    unsigned long    count = 0UL;
    unsigned long    i;
    int      flag;
    char         dummy;

    if (argc != 3 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s COUNT FLAG\n", argv[0]);
        fprintf(stderr, "\n");
        return 1;
    }

    if (sscanf(argv[1], " %lu %c", &count, &dummy) != 1) {
        fprintf(stderr, "%s: Invalid COUNT.\n", argv[1]);
        return 1;
    }
    if (count < 1UL) {
        fprintf(stderr, "%s: COUNT is too small.\n", argv[1]);
        return 1;
    }
    if (!(unsigned long)(count + 1UL)) {
        fprintf(stderr, "%s: COUNT is too large.\n", argv[1]);
        return 1;
    }

    if (sscanf(argv[2], " %d %c", &flag, &dummy) != 1) {
        fprintf(stderr, "%s: Invalid FLAG.\n", argv[2]);
        return 1;
    }

    wall_seconds = malloc(sizeof (double) * (size_t)count);
    cpu_cycles = malloc(sizeof (double) * (size_t)count);
    if (!wall_seconds || !cpu_cycles) {
        free(cpu_cycles);
        free(wall_seconds);
        fprintf(stderr, "Cannot allocate enough memory. Try smaller COUNT.\n");
        return 1;
    }

    printf("Call and measurement overhead:\n");
    fflush(stdout);
    for (i = 0UL; i < count; i++) {

        start_timing(&measurement);
        test_none(i, flag);
        stop_timing(&measurement);

        wall_seconds[i] = wall_timing(&measurement);
        cpu_cycles[i] = cpu_timing(&measurement);
    }
    report(wall_seconds, cpu_cycles, (size_t)count);

    printf("\n");
    printf("Measuring FLAG==0 calls: ");
    fflush(stdout);
    test_calls = 0L;
    test_sum = 0L;
    for (i = 0UL; i < count; i++) {

        start_timing(&measurement);
        test_work(i, 0);
        stop_timing(&measurement);

        wall_seconds[i] = wall_timing(&measurement);
        cpu_cycles[i] = cpu_timing(&measurement);
    }
    printf("%ld calls, sum %ld.\n", test_calls, test_sum);
    report(wall_seconds, cpu_cycles, (size_t)count);

    printf("\n");
    printf("Measuring FLAG==%d calls:", flag);
    fflush(stdout);
    test_calls = 0L;
    test_sum = 0L;
    for (i = 0UL; i < count; i++) {

        start_timing(&measurement);
        test_work(i, flag);
        stop_timing(&measurement);

        wall_seconds[i] = wall_timing(&measurement);
        cpu_cycles[i] = cpu_timing(&measurement);
    }
    printf("%ld calls, sum %ld.\n", test_calls, test_sum);
    report(wall_seconds, cpu_cycles, (size_t)count);


    printf("\n");
    printf("Measuring alternating FLAG calls: ");
    fflush(stdout);
    test_calls = 0L;
    test_sum = 0L;
    for (i = 0UL; i < count; i++) {

        start_timing(&measurement);
        test_work(i, i & 1);
        stop_timing(&measurement);

        wall_seconds[i] = wall_timing(&measurement);
        cpu_cycles[i] = cpu_timing(&measurement);
    }
    printf("%ld calls, sum %ld.\n", test_calls, test_sum);
    report(wall_seconds, cpu_cycles, (size_t)count);

    printf("\n");
    free(cpu_cycles);
    free(wall_seconds);
    return 0;
}

Put the three files in an empty directory, then compile and build ./call-test:

rm -f *.o
gcc -W -Wall -O3 -fomit-frame-pointer -c harness.c
gcc -W -Wall -O3 -fomit-frame-pointer -c work.c
gcc -W -Wall -O3 -fomit-frame-pointer -c test.c
gcc harness.o work.o test.o -lrt -o call-test

On AMD Athlon II X4 640, using gcc-4.6.3 (Xubuntu 10.04), running

./call-test 1000000 1

tells me that the overhead is just 2 clock cycles (< 1ns) for the test alone (branch not taken), and just 4 clock cycles (just over a nanosecond) when calling the second function which increases test_calls and adds the counter to test_sum.

When omitting all optimizations (use -O0 and leave out -fomit-frame-pointer when compiling), the test alone costs about 3 clock cycles (3 cycles if branch not taken), and about 9 cycles if the branch is taken and the work is done to update the two extra variables.

(The two extra variables let you easily see that the harness does actually do all it should do; they're just an extra check. And I wanted to have some work in the second function, so the timing differences would be easier to spot.)

The above interpretation is only valid for the case when the code is already cached; i.e. run recently. If the code is run only rarely, it won't be in cache. However, then the test overhead matters even less. Caching effects -- for example, if "nearby" code has been run (you can see this for the call overhead measurement, the other test functions code' tends to get cached too!) -- are much larger anyway. (While the test harness does produce the initial call results separately, don't put too much faith in it, since it does not try to clear any caches in any way.)

My conclusion is that adding

if (flag)
    debug_function_call();

to any normal code is perfectly fine: the overhead is literally neglible; practically irrelevant. As always, consider the overall algorithm instead. Any enhancements in the algorithm yield much bigger rewards than worrying about the code the compiler generates.

(Since I wrote the test code above at one sitting, there are likely some bugs and/or brainfarts in them. Check, and if you find any, let me know below so I can fix the code.)

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86