Intrinsic __lzcnt64 returns different values with different compile options

Question

I have the following code:

#include <stdint.h>
#include <stdio.h>
#include <x86intrin.h>

long long lzcnt(long long l)
{
    return __lzcnt64(l);
}

int main(int argc, char** argv)
{
    printf("%lld\n", lzcnt(atoll(argv[1])));
    return 0;
}

Running with different compilers and options I get (assembly shown):

Clang

$ clang -Wall src/test.c -D__LZCNT__ && ./a.out 2047
53

0000000000400560 <lzcnt>:
400560:   55                      push   %rbp
400561:   48 89 e5                mov    %rsp,%rbp
400564:   48 89 7d f0             mov    %rdi,-0x10(%rbp)
400568:   48 8b 7d f0             mov    -0x10(%rbp),%rdi
40056c:   48 89 7d f8             mov    %rdi,-0x8(%rbp)
400570:   48 8b 7d f8             mov    -0x8(%rbp),%rdi
400574:   48 0f bd ff             bsr    %rdi,%rdi
400578:   48 83 f7 3f             xor    $0x3f,%rdi
40057c:   89 f8                   mov    %edi,%eax
40057e:   48 63 c0                movslq %eax,%rax
400581:   5d                      pop    %rbp
400582:   c3                      retq   
400583:   66 66 66 66 2e 0f 1f    data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
40058a:   84 00 00 00 00 00

GCC without -mlzcnt

$ gcc -Wall src/test.c -D__LZCNT__ && ./a.out 2047
53

0000000000400580 <lzcnt>:
400580: 55                    push   %rbp
400581: 48 89 e5              mov    %rsp,%rbp
400584: 48 89 7d e8           mov    %rdi,-0x18(%rbp)
400588: 48 8b 45 e8           mov    -0x18(%rbp),%rax
40058c: 48 89 45 f8           mov    %rax,-0x8(%rbp)
400590: 48 0f bd 45 f8        bsr    -0x8(%rbp),%rax
400595: 48 83 f0 3f           xor    $0x3f,%rax
400599: 48 98                 cltq   
40059b: 5d                    pop    %rbp
40059c: c3                    retq

GCC with -mlzcnt

$ gcc -Wall src/test.c -D__LZCNT__ -mlzcnt && ./a.out 2047
10

0000000000400580 <lzcnt>:
400580: 55                    push   %rbp
400581: 48 89 e5              mov    %rsp,%rbp
400584: 48 89 7d e8           mov    %rdi,-0x18(%rbp)
400588: 48 8b 45 e8           mov    -0x18(%rbp),%rax
40058c: 48 89 45 f8           mov    %rax,-0x8(%rbp)
400590: f3 48 0f bd 45 f8     lzcnt  -0x8(%rbp),%rax
400596: 48 98                 cltq   
400598: 5d                    pop    %rbp
400599: c3                    retq

G++ without -mlzcnt

$ g++ -Wall src/test.c -D__LZCNT__ && ./a.out 2047
In file included from /usr/lib/gcc/x86_64-redhat-linux/4.8.2/include/immintrin.h:64:0,
                 from /usr/lib/gcc/x86_64-redhat-linux/4.8.2/include/x86intrin.h:62,
                 from src/test.c:3:
/usr/lib/gcc/x86_64-redhat-linux/4.8.2/include/lzcntintrin.h: In function ‘short unsigned int __lzcnt16(short unsigned int)’:
/usr/lib/gcc/x86_64-redhat-linux/4.8.2/include/lzcntintrin.h:38:29: error: ‘__builtin_clzs’ was not declared in this scope
return __builtin_clzs (__X);

G++ with -mlzcnt

$ g++ -Wall src/test.c -D__LZCNT__ -mlzcnt  && ./a.out 2047
10

0000000000400640 <_Z5lzcntx>:
400640: 55                    push   %rbp
400641: 48 89 e5              mov    %rsp,%rbp
400644: 48 89 7d e8           mov    %rdi,-0x18(%rbp)
400648: 48 8b 45 e8           mov    -0x18(%rbp),%rax
40064c: 48 89 45 f8           mov    %rax,-0x8(%rbp)
400650: f3 48 0f bd 45 f8     lzcnt  -0x8(%rbp),%rax
400656: 48 98                 cltq   
400658: 5d                    pop    %rbp
400659: c3                    retq

The difference is quite clearly the use of -mlzcnt, however I'm actually working in C++ and without that option it doesn't compile on g++ (clang++ is fine). It looks like when -mlzcnt is used then the result is 63-(result without -mlzct). Is there any documentation on the -mlzcnt option for gcc (I looked through the info files, but couldn't find anything)? Does it do anything more that opt for the lzcnt instruction?

Did you disassemble the programs to see if they use the expected instruction? Are you sure both platforms *have* the instruction? — unwind, Oct 30 '13 at 09:45
This question is not answerable without providing the assembly. — Daniel Kamil Kozar, Oct 30 '13 at 09:47
Looks like the Mac+Clang is using the `BSR` instruction, where as the Linux+GCC is using `LZCNT`. — Michael Barker, Oct 30 '13 at 09:59
You appear to be calling __lzcnt64 but passing a 32 bit integer. Perhaps that's confusing the compiler... — Ben, Oct 30 '13 at 11:31
Updated question to take in comments and more investigation (removed the OS variance, all code is compiled on the same Linux box). — Michael Barker, Oct 30 '13 at 19:38
Looks from the headers as if where you don't specify -mlzcnt it should generate an error `LZCNT instruction is not enabled`. (http://clang.llvm.org/doxygen/lzcntintrin_8h_source.html), because you are using an intrinsic which is not available. Possibly there is a compiler or header supplied routine somewhere (or it wouldn't work at all). Probably that is buggy. I suggest you look for that. — Ben, Oct 31 '13 at 09:26

Bill Lynch · Accepted Answer · 2013-10-30T21:55:08.447

5

First off, I'm able to perfectly replicate your problem with both clang 3.3 and gcc 4.8.1.

Here's my thoughts... I'm only about 50% on this.

LZCNT is an instruction that may not be supported by your computer.
Wikipedia suggests that Haswell support is needed for LZCNT
We can try to verify this information by using the Linux application cpuid. (Which is included in Debian, RHEL, etc).
Wikipedia again suggests that "Support is indicated via the CPUID.80000001H:ECX.ABM[Bit 5] flag".

Let's look at my system (which is a Xeon X3430, Lynnfield, Nehalem).

[4:48pm][wlynch@apple /tmp] sudo cpuid -1ir | grep 80000001
   0x80000001 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000001 edx=0x28100800

So, bit 23 of ECX is not true. So my system doesn't support LZCNT.

It also looks like it just happens that my machine interprets the unsupported LZCNT as a BSR.

edited Oct 30 '13 at 21:55

answered Oct 30 '13 at 21:49

Bill Lynch

80,138
16
128
173

1

The is basically the correct answer. The information I was missing was that -mlzcnt will force the generation of the lzcnt instruction even if the current platform doesn't support it. It is odd that the Intel CPUs will simply accept an lzcnt instruction and interpret it as a bsr rather than [reject with an illegal instruction](https://code.google.com/p/corkami/wiki/x86oddities#lzcnt). – Michael Barker Jan 19 '14 at 01:05
I think there is also a bug with the g++ compiler, which has a missing builtin (__builtin_clzs). If I comment references to that from the system headers it compiles and works. – Michael Barker Jan 19 '14 at 01:07
@Michael: Yup, this is correct. `lzcnt` is encoded as `rep bsr`, and x86 CPUs silently ignore insn prefix bytes that don't apply. This isn't future-proof, though, like happened here. An exception is the `rep ret` workaround for branch predictor limitations in old AMD CPUs: It's in widespread use, so future extensions can't redefine it: http://stackoverflow.com/a/32347393/224132. Usually it's easy to debug when you accidentally used an instruction not supported by your target machine, but not in this case. – Peter Cordes Sep 15 '15 at 20:28
@Michael: `-mlzcnt` tells the compiler the target does support the insn, so of course it behaves that way! Use `-march=native` if you want the compiler to produce code optimized for the same machine it's being compiled on (this isn't the default assumption). This includes using instructions that aren't supported by older CPUs. e.g. `-march=native` will use AVX when auto-vectorizing, on hosts that support it. Use `-march=core2 -mtune=haswell` if you want to *tune* for HSW, but make code that works on any CPU that supports everything Core2 does. – Peter Cordes Sep 15 '15 at 20:31

Ben · Answer 2 · 2013-10-30T11:58:21.247

1

You appear to be calling __lzcnt64 but passing a 32 bit integer. Perhaps that's confusing the compiler.

Possibly the one returning 10 is seeing some junk in the other half of the register?

Try this instead:

    long long int v = __lzcnt64(2047LL);

(Made it a long long literal).

edited Oct 30 '13 at 11:58

answered Oct 30 '13 at 11:31

Ben

34,935
6
74
113

Using this [code](https://gist.github.com/sharth/b689c85eb7e2288b2076), GCC 4.8.1, CentOS 6.4 x64. Under `-mlzcnt -O0`: I get 10. Under `-mlzcnt -O3`: I get 53. – Bill Lynch Oct 30 '13 at 21:10
What if you try `long long int v = 2047LL; v = __lzcnt64(v);`? Either way looks like a bug. – Ben Oct 31 '13 at 09:07

Intrinsic __lzcnt64 returns different values with different compile options

2 Answers2

Linked