3

I can find the n'th set bit of a 16-bit value by using a lookup table but this can't be done for a 32-bit value, without breaking it up and using several LUTs. This is discussed in How to efficiently find the n-th set bit? which suggests a use similar to this

#include <immintrin.h>
//...
unsigned i = 0x6B5;                 // 11010110101
unsigned n = 4;                     //    ^
unsigned j = _pdep_u32(1u << n, i);
int bitnum = __builtin_ctz(j));     // result is bit 7

I have profiled this suggested loop method

j = i;
for (k = 0; k < n; k++)
    j &= j - 1;
return (__builtin_ctz(j));

against the two variants of the bit-twiddling code in an answer from Nominal Animal. The branching variant of the twiddle was the fastest of the three code versions, not by much. However in my actual code, the above snippet using __builtin_ctz was faster. There are other answers such as from fuz who suggests what I found too: the bit-twiddles take a similar time as the loop method.

So I now want to try using _pdep_u32 but am unable to get _pdep_u32 recognised. I read in gcc.gnu.org

The following built-in functions are available when -mbmi2 is used.
...
unsigned int _pdep_u32 (unsigned int, unsigned int)
unsigned long long _pdep_u64 (unsigned long long, unsigned long long)
...

However I am using an online compiler and don't have access to the options.

Is there a way to enable options with the preprocessor?

There is plenty of material about controlling the preprocessor from command line options, but I can't find how to do it the other way round.

Ultimately I want to use the 64-bit version _pdep_u64.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Weather Vane
  • 33,872
  • 7
  • 36
  • 56
  • GCC will never emit asm instructions that aren't enabled via options, target pragmas, or function attributes. (Unless you use inline asm). The preprocessor won't help, except maybe to trick `#immintrin.h` into defining stuff that you can use in a function with a pragma. – Peter Cordes Jan 04 '20 at 13:11
  • @PeterCordes thank you, so could it be coded directly using inline asm in the C code, without enabling options? Aside: MSVC compiled the instrinsic without fuss, but the program crashed (I suppose `_pdep_u32` is not supported by my cpu as suggested in the linked question). – Weather Vane Jan 04 '20 at 13:17
  • Yeah, MSVC and GCC have different philosophies about intrinsics. MSVC lets you use whatever, under the assumption that you'll do runtime dispatching. You're using an online compiler but running locally, and you can't pass GCC options? That sounds horrible. – Peter Cordes Jan 04 '20 at 13:18

1 Answers1

4

Other than inline asm, gcc/clang will never emit an asm instruction that isn't enabled via a command line option, function attribute, or pragma.

If you can't use GCC target options on the command line like a normal person (e.g. -march=native to enable everything supported by the compile host, and tune for that machine), you can instead use a pragma to set target-specific options. It doesn't work very well, though; it seems #pragma target("arch=skylake") breaks GCC's immintrin.h (if you put the pragma before the include), apparently trying to compile AVX512 functions without AVX512 enabled. The GCC docs do show examples of using "arch=core2" as a function attribute, or even "sse4.1,arch=core2"

You can set target("bmi2,tune=skylake") though without breaking anything (before or after #include <immintrin.h>. (unlike arch=, tune= doesn't enable ISA extensions, just affects code-gen choices.)

#include <immintrin.h>
// #pragma GCC target ("arch=haswell")  // also broken, disables BMI2??

#pragma GCC target ("bmi2,tune=skylake") // this can be after immintrin.h

int foo(unsigned x) {
  // unsigned i = 0x6B5;                 // 11010110101
  unsigned i = x;
  unsigned n = 4;                        //    ^
  unsigned j = _pdep_u32(1u << n, i);
  int bitnum = __builtin_ctz(j);     // result is bit 7
  return bitnum;
}

int constprop() {  // check that GCC can still "understand" the intrinsic
  return foo(0x6B5);
}

compiles cleanly on Godbolt with only -O3, no -m options.

foo:
        mov     r8d, edi
        mov     edi, 16
        pdep    edi, edi, r8d
        bsf     eax, edi
        ret

constprop:
        mov     eax, 7
        ret

If you ever have trouble getting immintrin.h to define the right wrapper, you can look in GCC's headers to find out how they're defined. e.g. _pdep_u32 is a wrapper for __builtin_ia32_pdep_si (single integer) and u64 is a wrapper for __builtin_ia32_pdep_di (double integer).

These __builtin functions are always defined by GCC even with no headers, but these ia32 builtins can only be used in functions with compatible target options. There isn't a fallback like with __builtin_popcount for targets that don't support the instruction.

But it seems that for BMI2 instructions at least, _pdep_u32 and _pdep_u64 get defined by GCC's immintrin.h regardless of command line options (which would define a __BMI2__ CPP macro). Perhaps the reason is this use case: single functions with a target attribute, or a block of functions with a pragma.

The _pdep_u64 version requires BMI2 + x86-64 while the 32-bit version only requires BMI2 (i.e. is available in 32-bit mode as well). pdep is a scalar integer instruction so 64-bit operand-size requires 64-bit mode.


Aside: MSVC compiled the instrinsic without fuss

Yeah, MSVC and GCC/clang have different philosophies about intrinsics. MSVC lets you use whatever, under the assumption that you'll do runtime dispatching to make sure the execution never reaches an intrinsic that isn't supported on this machine.

This might be related to MSVC essentially not optimizing intrinsics at all, often not even doing simple constant-propagation through them. Because it's trying to make sure the corresponding asm instruction never executes when the intrinsic isn't reached in the C++ abstract machine, I guess, and takes the slow and safe approach.

but the program crashed (I suppose _pdep_u32 is not supported by my cpu as suggested in the linked question).

Yup, not all CPUs have BMI2. Nothing you can do with GCC will change that.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks Peter I'll try out some of this. The GCC target is completely online, that is I post the source code and tell it to compile/run, but not how. – Weather Vane Jan 04 '20 at 13:49
  • That's excellent: a similar test program compiled and printed `7` as needed. The `#pragma GCC target ("bmi2,tune=skylake")` did it. So I'll go back to my profiling now and see if there is a significant speed-up. – Weather Vane Jan 04 '20 at 14:12
  • @WeatherVane: note that as I showed with the `constprop` function, the program doesn't contain a `pdep` instruction, so it will run on machines without BMI2. The inputs were both compile-time constants so GCC simply did the math at compile time (using some portable pure C implementation) and emitted asm that does `mov eax,7`. – Peter Cordes Jan 04 '20 at 14:20
  • I noticed the `mov eax, 7` but `foo` contains `pdep edi, edi, r8d`. Anyway, I am about to try it with a decent running time and a result the compiler shouldn't be able to guess. – Weather Vane Jan 04 '20 at 14:26
  • @WeatherVane: yeah, of course. Look at the source. `foo` takes a runtime variable input. `constprop` inlines it for a constant arg. On your machine that doesn't support BMI2, this will crash for cases where the compiler can't replace `pdep` with a constant. – Peter Cordes Jan 04 '20 at 14:31
  • Sadly, I get runtime SIGILL error, but many thanks for the advice, it was worth trying to discover it's not supported. – Weather Vane Jan 04 '20 at 14:58
  • @WeatherVane: you already discovered that with MSVC, but ok, sure. Still a potentially useful answer for future readers. – Peter Cordes Jan 04 '20 at 15:08
  • The MSVC and GCC codes were running on different computers: I needed to know if the target host supports `pdep` and at first it would not even compile, but with your help it did. – Weather Vane Jan 04 '20 at 15:37
  • @WeatherVane: Ah, that makes more sense. I wondered after I commented if that was the case. You could have tested with inline asm to see if it faulted or not, before trying to get GCC proper to emit it. – Peter Cordes Jan 04 '20 at 15:43