Why bitops in linux kernel performance in slower than mine?

Question

I'm looking for a best bitops lib or function that wrote with c language, thus I think linux kernel was the best in this case.
So I copy Linux kernel set_bit function from arch/x86/include/asm/bitops.h and compare with mine and saw a strange results!!!

kernel_bitops.c

#define ADDR                   BITOP_ADDR(addr)
#define __ASM_FORM(x)          #x 
#define BITOP_ADDR(x)          "m" (*(volatile long *) (x))
#define __ASM_SEL(a,b)          __ASM_FORM(b)
#define __ASM_SIZE(inst, ...)   __ASM_SEL(inst##l##__VA_ARGS__, inst##q##__VA_ARGS__)

__always_inline void linux_set_bit(long nr, volatile unsigned long *addr)
{
  asm volatile(__ASM_SIZE(bts) " %1,%0" : : ADDR, "Ir" (nr) : "memory");
}

my_bitops.c

#define SETBIT(_value, _bitIndex)   _value |= (1ul<<(_bitIndex))
__always_inline void mine_set_bit(long nr, volatile unsigned long *addr)
{
    SETBIT(*addr,nr)
}

main.c

#define ARRAY_SIZE  10000000
static unsigned long num_array[ARRAY_SIZE];
unsigned long int num = 0x0F00000F00000000;
for (int i = 0; i < ARRAY_SIZE; i++)
    num_array[i] = num;

clock_t start = clock();
for (unsigned long int i = 0 ; i < ARRAY_SIZE; i++)
    for (unsigned long int j = 0; j < sizeof(unsigned long int) * 8; j++)
         // linux_set_bit(j, &num_array[i]);
         // mine_set_bit(j, &num_array[i]);
clock_t end = clock();

Time took for Linux: 1375991 us
Time took for mine: 912256 us
CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz

Assembly code that generated with -O2 are:

            26 [1]                 linux_set_bit(j, &num_array[i]);
    0x4005c0  <+   90>        48 8b 45 d0                    mov    -0x30(%rbp),%rax
    0x4005c4  <+   94>        48 c1 e0 03                    shl    $0x3,%rax
    0x4005c8  <+   98>        48 8d 90 60 10 60 00           lea    0x601060(%rax),%rdx
    0x4005cf  <+  105>        48 8b 45 d8                    mov    -0x28(%rbp),%rax
    0x4005d3  <+  109>        48 89 d6                       mov    %rdx,%rsi
    0x4005d6  <+  112>        48 89 c7                       mov    %rax,%rdi
    0x4005d9  <+  115>        e8 69 00 00 00                 callq  0x400647 <linux_set_bit>

        71 [1]    asm volatile(__ASM_SIZE(bts) " %1,%0" : : ADDR, "Ir" (nr) : "memory");
0x400653  <+   12>        48 8b 45 f0  mov    -0x10(%rbp),%rax
0x400657  <+   16>        48 8b 55 f8  mov    -0x8(%rbp),%rdx
0x40065b  <+   20>        48 0f ab 10  bts    %rdx,(%rax)

        19 [1]      SETBIT(*addr,nr);
0x400653  <+   12>        48 8b 45 f0     mov    -0x10(%rbp),%rax
0x400657  <+   16>        48 8b 00        mov    (%rax),%rax
0x40065a  <+   19>        48 8b 55 f8     mov    -0x8(%rbp),%rdx
0x40065e  <+   23>        be 01 00 00 00  mov    $0x1,%esi
0x400663  <+   28>        89 d1           mov    %edx,%ecx
0x400665  <+   30>        d3 e6           shl    %cl,%esi
0x400667  <+   32>        89 f2           mov    %esi,%edx
0x400669  <+   34>        89 d2           mov    %edx,%edx
0x40066b  <+   36>        48 09 c2        or     %rax,%rdx
0x40066e  <+   39>        48 8b 45 f0     mov    -0x10(%rbp),%rax
0x400672  <+   43>        48 89 10        mov    %rdx,(%rax)

Where i am wrong? Or Linux has a slow operation?

Have you repeated your benchmark multiple times? Have you looked at the generated assembly? How do you compile your code? — Macmade, Oct 04 '20 at 16:42
yes I repeated many times, unfortunately I'm not knowing about assembly, and compile my code in release mode (O3) — HP_perfect, Oct 04 '20 at 16:54
Get the assembly out of the code, `gcc -S`, or use `objdump -d yourexenamegoeshere` — Antti Haapala -- Слава Україні, Oct 04 '20 at 16:58
Ah, you're calling these as **functions**... that's contrary to **all** expectations, the `_always_inline` could not possibly work at all then. — Antti Haapala -- Слава Україні, Oct 04 '20 at 17:02
Tired of guessing, please provide a [mre], i.e. a *single C file example* with includes and the outputs that proves your point. — Antti Haapala -- Слава Україні, Oct 04 '20 at 17:04
Can you be sure your CPU didn't throttle its clock speed up or down between your tests? — Nate Eldredge, Oct 04 '20 at 21:22
`mov -0x10(%rbp),%rax` - That doesn't look like optimized asm. Without a VLA in main, GCC should have optimized away the frame pointer. Also, `T *addr` should be "private" and not need to be reloaded even across a `volatile*` access. And `linux_set_bit` should be inlined but you have a `callq 0x400647 `. Is that because you put it in a separate file and didn't compile with `-fLTO`, adding overhead and preventing constant-propagation? Putting everything in one C file like Antti said would fix *that* problem, and compile it with `-O2` or `-O3`. — Peter Cordes, Oct 06 '20 at 01:42

score 3 · Accepted Answer · answered Oct 04 '20 at 21:51

3

The main difference is that your code can't handle "bit number" being larger than the number of bits in an unsigned long, and Linux's version can. Because of this difference you've written a loop that works with your version's limitations, that isn't ideal when those limitations aren't there, and isn't ideal for Linux's version.

Specifically; for Linux's version, you could/should do this:

for (unsigned long int i = 0 ; i < ARRAY_SIZE * sizeof(unsigned long int) * 8; i++) {
    linux_set_bit(i, num_array);
}

By removing the entire inner loop overhead, plus the calculation needed to find a pointer to an element of the array (the &num_array[i] part), it'll be significantly faster (and probably be faster than yours).

answered Oct 04 '20 at 21:51

Brendan

35,656
2
39
66

Good spot. This is an odd feature of the `bt*` instructions with a memory operand: they appear to address a particular word in memory, but they can actually access a completely different word. The tradeoff is that they are slow. There is some discussion at https://stackoverflow.com/a/50833363/634919. – Nate Eldredge Oct 04 '20 at 22:19
Excellent, But I still can't understand why in 64 bit operation Linux is slow? And if somewhere in Linux source code use many times of just 64 bit operation, it might be slower performance? – HP_perfect Oct 05 '20 at 16:41
@HP_perfect: I'm not sure how its used in Linux (e.g. maybe it's only used for large bitfields where the benefits matter more). Note that kernel code is also more like "many infrequently executed functions/services" where code size has more impact on instruction cache misses/performance than it would for normal code (and inside tight loops). – Brendan Oct 05 '20 at 18:47

score 2 · Answer 2 · answered Oct 04 '20 at 22:30

Yes, bts %reg, (mem) is slow (https://uops.info); IDK why Linux forces that form without using a lock prefix. Possibly the operation needs to be atomic wrt. interrupts on the same core, which doing it with a single instruction accomplishes.

If not, it's faster to emulate it with multiple instructions to calculate the address of the byte or dword containing the bit you want: How can memory destination BTS be significantly slower than load / BTS reg,reg / store?

(bts imm, (mem) is not bad, though, so you could use __builtin_constant_p(bitpos) and use memory-destination bts.)

As @Brendan points out, your version only works for bitpos < sizeof(unsigned long) * CHAR_BIT, i.e. within the first qword.

I don't know why exactly Linux forces a memory destination bts with a volatile pointer. Presumably there's some kind of reason other than performance. Otherwise, yes it's a missed optimization.

Why bitops in linux kernel performance in slower than mine?

2 Answers2