7

In Linux kernel code there is a macro used to test bit ( Linux version 2.6.2 ):

#define test_bit(nr, addr)                      \
        (__builtin_constant_p((nr))             \
         ? constant_test_bit((nr), (addr))      \
         : variable_test_bit((nr), (addr)))

where constant_test_bit and variable_test_bit are defined as:

static inline int constant_test_bit(int nr, const volatile unsigned long *addr  )
{       
        return ((1UL << (nr & 31)) & (addr[nr >> 5])) != 0;
}


static __inline__ int variable_test_bit(int nr, const volatile unsigned long *addr)
{       
        int oldbit;

        __asm__ __volatile__(
                "btl %2,%1\n\tsbbl %0,%0"
                :"=r" (oldbit)
                :"m" (ADDR),"Ir" (nr));
        return oldbit;
}

I understand that __builtin_constant_p is used to detect whether a variable is compile time constant or unknown. My question is: Is there any performance difference between these two functions when the argument is a compile time constant or not? Why use the C version when it is and use the assembly version when it's not?

UPDATE: The following main function is used to test the performance:

constant, call constant_test_bit:

int main(void) {
        unsigned long i, j = 21;
        unsigned long cnt = 0;
        srand(111)
        //j = rand() % 31;
        for (i = 1; i < (1 << 30); i++) {
                j = (j + 1) % 28;
                if (constant_test_bit(j, &i))
                        cnt++;
        }
        if (__builtin_constant_p(j))
                printf("j is a compile time constant\n");
        return 0;
}

This correctly outputs the sentence j is a...

For the other situations just uncomment the line which assigns a "random" number to j and change the function name accordingly. When that line is uncommented the output will be empty, and this is expected.

I use gcc test.c -O1 to compile, and here is the result:

constant, constant_test_bit:

$ time ./a.out 

j is compile time constant

real    0m0.454s
user    0m0.450s
sys     0m0.000s

constant, variable_test_bit( omit time ./a.out, same for the following ):

j is compile time constant

real    0m0.885s
user    0m0.883s
sys     0m0.000s

variable, constant_test_bit:

real    0m0.485s
user    0m0.477s
sys     0m0.007s

variable, variable_test_bit:

real    0m3.471s
user    0m3.467s
sys     0m0.000s

I have each version runs several times, and the above results are the typical values of them. It seems the constant_test_bit function is always faster than the variable_test_bit function, no matter whether the parameter is a compile time constant or not... For the last two results( when j is not constant ) the variable version is even dramatically slower than the constant one. Am I missing something here?

Shafik Yaghmour
  • 154,301
  • 39
  • 440
  • 740
Xiangyu Zhu
  • 171
  • 1
  • 2
  • 15
  • Could be, but the only way to find out is to measure. – deviantfan May 03 '15 at 11:22
  • Obviously someone thought it made a difference in perf, or there wouldn't be 2 versions. For the details, you have 4 cases to consider (passing a constant/non-constant to either function). What do you think happens in each case? Did you look at the generated assembly? – Marc Glisse May 03 '15 at 14:11
  • @deviantfan I have added the performance results. – Xiangyu Zhu May 03 '15 at 15:14
  • @MarcGlisse That's what I'm thinking of. May spend some time investigating the assembly code at a later time. – Xiangyu Zhu May 03 '15 at 15:16
  • Note that you're not exercising the "one bit from a BIG array" ability of these functions, which is where the assembler version might win (or not). – Ben Voigt May 04 '15 at 03:21

1 Answers1

5

Using godbolt we can do a experiment using of constant_test_bit, the following two test functions are compiled gcc with the -O3 flag:

// Non constant expression test case
int func1(unsigned long i, unsigned long j)
{
  int x = constant_test_bit(j, &i) ;
  return x ;
}

// constant expression test case
int func2(unsigned long i)
{
  int x = constant_test_bit(21, &i) ;
  return x ;
}

We see the optimizer is able to optimize the constant expression case to the following:

shrq    $21, %rax
andl    $1, %eax

while the non-constant expression case ends up as follows:

sarl    $5, %eax
andl    $31, %ecx
cltq
leaq    -8(%rsp,%rax,8), %rax
movq    (%rax), %rax
shrq    %cl, %rax
andl    $1, %eax

So the optimizer is able to produce much better code for the constant expression case and we can see that the non-constant case for constant_test_bit is pretty bad compared to the hand rolled assembly in variable_test_bit and the implementer must believe the constant expression case for constant_test_bit ends up being better than:

btl %edi,8(%rsp)
sbbl %esi,%esi 

for most cases.

As to why your test case seems to show a different conclusion is that your test case it is flawed. I have not been able to suss out all the issues. But if we look at this case using constant_test_bit with a non-constant expression we can see the optimizer is able to move all the work outside the look and reduce the work related to constant_test_bit inside the loop to:

movq    (%rax), %rdi

even with an older gcc version, but this case may not be relevant to the cases test_bit is being used in. There may be more specific cases where this kind of optimization won't be possible.

Shafik Yaghmour
  • 154,301
  • 39
  • 440
  • 740
  • But why use the variable version when the parameter is non constant? The assembly of constant version generated by gcc seems still better than the variable version in that case. – Xiangyu Zhu May 04 '15 at 02:30
  • @XiangyuZhu updated my answer, your test has some issues, it is hard to tease them all out but it is likely that the initial reason for this choice was driven by the reasons I laid out in my initial answer. I would hope they benchmarked using relevant cases and made sure the assumption really fit but hard to know. – Shafik Yaghmour May 04 '15 at 03:16
  • Yes, maybe the way my code used to test is so "special" that gcc is able to optimize it more than `variable_test_bit`. I looked up recent kernel code and found out similar code is still there, so there must be a good reason for this (using different versions of function). However I decide to just leave it alone for now. Thanks for your answer. – Xiangyu Zhu May 04 '15 at 03:41
  • 2
    @XiangyuZhu optimization has greatly improved since Linux started it is possible that the hand optimization no longer matters like it once did but figuring that out would require understanding the specific use cases. The danger with this is that people may just copy such optimizations blindly without benchmarking their case. – Shafik Yaghmour May 04 '15 at 13:09