14

Here are two ways to set an individual bit in C on x86-64:

inline void SetBitC(long *array, int bit) {
   //Pure C version
   *array |= 1<<bit;
}

inline void SetBitASM(long *array, int bit) {
   // Using inline x86 assembly
   asm("bts %1,%0" : "+r" (*array) : "g" (bit));
}

Using GCC 4.3 with -O3 -march=core2 options, the C version takes about 90% more time when used with a constant bit. (Both versions compile to exactly the same assembly code, except that the C version uses an or [1<<num],%rax instruction instead of a bts [num],%rax instruction)

When used with a variable bit, the C version performs better but is still significantly slower than the inline assembly.

Resetting, toggling and checking bits have similar results.

Why does GCC optimize so poorly for such a common operation? Am I doing something wrong with the C version?

Edit: Sorry for the long wait, here is the code I used to benchmark. It actually started as a simple programming problem...

int main() {
    // Get the sum of all integers from 1 to 2^28 with bit 11 always set
    unsigned long i,j,c=0;
    for (i=1; i<(1<<28); i++) {
        j = i;
        SetBit(&j, 10);
        c += j;
    }
    printf("Result: %lu\n", c);
    return 0;
}

gcc -O3 -march=core2 -pg test.c
./a.out
gprof
with ASM: 101.12      0.08     0.08                             main
with C:   101.12      0.16     0.16                             main

time ./a.out also gives similar results.

Dumb Guy
  • 3,336
  • 21
  • 23
  • 1
    Is it really a "common operation"? The most common time I see bit manipulation used is in premature optimization by people thinking they'll save memory by packing a bunch of flags into a single byte. – Anon. Jan 11 '10 at 02:26
  • 3
    Hmm...good point. Still, it's a pretty simple operation, and common in hardware drivers. Anyways, the point is the 90% performance decrease. – Dumb Guy Jan 11 '10 at 02:30
  • 1
    In driver programming, I'm pretty sure the "left-shift 1 to determine our mask" idiom isn't used much. Rather, you `or` with a predefined flag. – Anon. Jan 11 '10 at 02:41
  • You're right. The compiler does optimize out the shift when bit is constant. It's really strange how it's still 90% slower. – Dumb Guy Jan 11 '10 at 02:49
  • 1
    I'm surprised by the "90% slower" thing. The compiler likely expects `or` to take fewer clocks than `bts` - which is typically true. – Anon. Jan 11 '10 at 02:56
  • You are timing millions of these in a tight loop, correct? Could you post the exact code you are using to determine the timing? It might be that the tight loop increment+test uses the same CPU execution unit as the `OR`, leading to contention, while a `BTS` uses a different unit. – j_random_hacker Jan 11 '10 at 04:24
  • If your code is OK, you might want to submit this (if already not reported) to http://gcc.gnu.org/bugzilla/ – Laurynas Biveinis Jan 11 '10 at 07:22

5 Answers5

13

Why does GCC optimize so poorly for such a common operation?

Prelude: Since the late 1980s, focus on compiler optimization has moved away from microbenchmarks which focus on individual operations and toward macrobenchmarks which focus on applications whose speed people care about. These days most compiler writers are focused on macrobenchmarks, and developing good benchmark suites is something that is taken seriously.

Answer: Nobody on the gcc is using a benchmark where the difference between or and bts matters to the execution time of a real program. If you can produce such a program, you might be able to get the attention of people in gcc-land.

Am I doing something wrong with the C version?

No, this is perfectly good standard C. Very readable and idiomatic, in fact.

Norman Ramsey
  • 198,648
  • 61
  • 360
  • 533
2

Can you post the code that you are using to do the timing? This sort of operation can be tricky to time accurately.

In theory the two code sequences should be equally fast, so the most likely explanation (to my mind) is that something is causing your timing code to give bogus results.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • Yes. On all x86 CPUs I'm aware of, a basic ALU operation like `OR` is at least as fast as "exotic" ops like `BTS`. One possibility is that the increment+test in your timing loop uses the same CPU execution unit as the `OR`, leading to contention, while a `BTS` uses a different unit. – j_random_hacker Jan 11 '10 at 04:27
  • On recent x86 hardware, `or` can be dispatched onto more than one execution unit, so there shouldn't be any contention there. – Stephen Canon Jan 11 '10 at 04:51
2

For such code:

#include <stdio.h>
#include <time.h>

int main() {
  volatile long long i = 0;
  time_t start = time (NULL);
  for (long long n = 0; n < (1LL << 32); n++) {
    i |= 1 << 10;
  }
  time_t end = time (NULL);
  printf("C took %ds\n", (int)(end - start));
  start = time (NULL);
  for (long long n = 0; n < (1LL << 32); n++) {
    __asm__ ("bts %[bit], %[i]"
                  : [i] "=r"(i)
                  : "[i]"(i), [bit] "i" (10));
  }
  end = time (NULL);
  printf("ASM took %ds\n", (int)(end - start));
}

the result was:

C took 12s
ASM took 10s

My flag was (-std=gnu99 -O2 -march=core2). Without the volatile the loop was optimized out. gcc 4.4.2.

No difference was with:

__asm__ ("bts %[bit], %[i]"
              : [i] "+m"(i)
              : [bit] "r" (10));

So probably the answer was - noone cares. In microbenchmark the only difference is the one between those two methods but in real life I belive such code does not take much CPU.

Additionally for such code:

#include <stdio.h>
#include <time.h>

int main() {
  volatile long long i = 0;
  time_t start = time (NULL);
  for (long long n = 0; n < (1L << 32); n++) {
    i |= 1 << (n % 32);
  }
  time_t end = time (NULL);
  printf("C took %ds\n", (int)(end - start));
  start = time (NULL);
  for (long long n = 0; n < (1L << 32); n++) {
    __asm__ ("bts %[bit], %[i]"
                  : [i] "+m"(i)
                  : [bit] "r" (n % 32));
  }
  end = time (NULL);
  printf("ASM took %ds\n", (int)(end - start));
}

The result was:

C took 9s
ASM took 10s

Both results were 'stable'. Testing CPU 'Intel(R) Core(TM)2 Duo CPU T9600 @ 2.80GHz'.

Maciej Piechotka
  • 7,028
  • 6
  • 39
  • 61
  • `long long` suffix is `LL`, not `L` – phuclv Mar 29 '17 at 06:21
  • 1
    @LưuVĩnhPhúc good point. Fixed though it should not affect the result as Linux is LP64 platform (it might affect Windows or other LLP platforms). – Maciej Piechotka Mar 29 '17 at 20:26
  • Looking at your second test, this doesn't seem like a 'fair' comparison. You are forcing `i` to be in memory instead of a register, which the C code probably does not do. How about `__asm__ ("bts %[bit], %[i]" : [i] "+r"(i) : [bit] "r" (n % 32));`? – David Wohlferd Mar 29 '17 at 23:01
  • @DavidWohlferd I do force it in C as well by the use of volatile keyword. Your code will just generate additional `mov` instructions (which likely won't impact anything due to uops optimization). – Maciej Piechotka Mar 29 '17 at 23:04
  • @MaciejPiechotka - And yet when examining the output (gcc 6.1 -O3 -m64) that's exactly what the C code does (`orq %r8, %rax ; movq %rax, 40(%rsp)`). While it's hard to say with such tiny performance changes, it appears that "r" is indeed a smidgen faster (no doubt due to the uops optimization you mentioned). But (more importantly) if the goal is to compare the perf of `or` vs `bts`, making sure both bits of code use the same form seems important. – David Wohlferd Mar 31 '17 at 02:29
  • @DavidWohlferd Hmm. Probably because operation on registers are used by compilers so probably they work harder on them then on operations which operate directly on memory (which are likely to be in hand-written 16-bit assembly). I'm still a bit surprised as I would expect L1 to be the long pole[1]. However please do specify the platform because it may differ from generation to generation (I've heard that `inc` was much slower then `add 1` on P4 even though it was other way round on P3). – Maciej Piechotka Mar 31 '17 at 02:49
  • [1] That said there might be some tricks to avoid snooping thus memory being stored entirely in register file. One way or another I don't think this is an operation which in most real life application would have much difference... – Maciej Piechotka Mar 31 '17 at 02:50
  • "I don't think this is an operation which in most real life application would have much difference" - I agree. I'm not even sure what brought me to this particular question, but the inline asm stuff tends to catch my eye and the "m" just didn't look right. FWIW, the computer where using "r" was slightly faster was an I7 (820QM). I'm not sure what more there is to learn here, and the OP is long gone. – David Wohlferd Mar 31 '17 at 05:00
  • @DavidWohlferd I guess - benchmarking is fun but hard. – Maciej Piechotka Mar 31 '17 at 19:37
1

This is a very very common operation on embedded systems which are generally resource constrained. 10 Cycles vs 5 Cycles is a nasty performance penalty on such systems. There are many cases when one wants to access IO ports or use 16 or 32 bit registers as Boolean bit flags to save memory.

The fact is that if(bit_flags& 1<<12) is far more readable [and portable when implemented with a library] than the assembly equivalent. Likewise for IO_PINS|= 1<<5; These are unfortunately many times slower, so the awkward asm macros live on.

In many ways, the goals of embedded and userspace applications are opposite. The responsiveness of external communications (to a User Interface or Machine Interface) are of minor importance, while ensuring a control loop (eqiv. to a micro-bechmark) completes in minimal time is absolutely critical and can make or break a chosen processor or control strategy.

Obviously if one can afford a multi-GHz cpu and all the associated peripherals, chip-sets, etc needed to support that, one does not really need worry about low-level optimisation at all. A 1000x slower micro-controller in a real-time control system means that saving clock cycles is 1000x more important.

0

I think you're asking a lot of your optimizer.

You might be able to help it out a little by doing a `register long z = 1L << bit;", then or-ing that with your array.

However, I assume that by 90% more time, you're meaning that the C version takes 10 cycles and the asm version takes 5 cycles, right? How does the performance compare at -O2 or -O1?

Seth
  • 45,033
  • 10
  • 85
  • 120