15

The Haswell architectures comes up with several new instructions. One of them is PEXT (parallel bits extract) whose functionality is explained by this image (source here):

pext

It takes a value r2 and a mask r3 and puts the extracted bits of r2 into r1.

My question is the following: what would be the equivalent code of an optimized templated function in pure standard C++11, that would be likely to be optimized to this instruction by compilers in the future.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Vincent
  • 57,703
  • 61
  • 205
  • 388
  • 1
    I don't think compilers will ever try to work out if a function is doing exactly that and compile it to a single instruction. If there was some built-in way to do it, sure, but everybody will have different implementations. So I'm not sure your question is answerable as it is. – Joseph Mansfield Jan 15 '14 at 17:32
  • 1
    I can't think of a reason why you would need to do that. And why a template? – graham.reeds Jan 15 '14 at 17:34
  • 2
    Wish I could downvote comments. Graham, just because you don't have a need to twiddle bits around doesn't invalidate the question. When you are writing low-level code (compression? graphics?), fast bit juggling can be hugely valuable. – StilesCrisis Jan 15 '14 at 20:25
  • 1
    Duplicate of this question in disguise? http://stackoverflow.com/questions/21141848/ Is this still meant for a supercomputer doing this billions of iterations? – user541686 Jan 16 '14 at 10:50
  • @Mehrdad your comment is invalid. Vincent, if you want this instruction, you could use it directly via inline assembly, instead of working around it in C++. this way, you can be certain that the compiler does what you want, and it should be much clearer to read. what compiler are you using? – Andreas Grapentin Jan 16 '14 at 10:56
  • Some people (like me) need to support multiple computing architectures with their bit fiddling code... For people like us, it would really be nice to be able know the coding pattern that a compiler would automatically recognize as corresponding to pext... Especially if that coding pattern would be equally recognizable for optimization in other architectures than x86_64. – Michael Back Mar 29 '21 at 10:46

2 Answers2

5

Here is some code from Matthew Fioravante's stdcxx-bitops GitHub repo that was floated to the std-proposals mailinglist as a preliminary proposal to add a constexpr bitwise operations library for C++.

#ifndef HAS_CXX14_CONSTEXPR
#define HAS_CXX14_CONSTEXPR 0
#endif

#if HAS_CXX14_CONSTEXPR
#define constexpr14 constexpr
#else
#define constexpr14
#endif

//Parallel Bits Extract
//x    HGFEDCBA
//mask 01100100
//res  00000GFC
//x86_64 BMI2: PEXT
template <typename Integral>
constexpr14 Integral extract_bits(Integral x, Integral mask) {
  Integral res = 0;
  for(Integral bb = 1; mask != 0; bb += bb) {
    if(x & mask & -mask) {
      res |= bb;
    }
    mask &= (mask - 1);
  }
  return res;
}
TemplateRex
  • 69,038
  • 19
  • 164
  • 304
  • This is interesting if it optimizes to pext on x86-64... Does it also optimize to "something cool" on other architectures (my main application is morton decoding). Also, does this optimize as well in C on gcc or clang (without the use of templates)? – Michael Back Mar 29 '21 at 10:50
1

As of August 2022, there still does not seem to be a way to write code for which a compiler will generate a PEXT instruction.

However, at this point (C++20), you can call many functions in <bit> that wrap assembly code.

In general, you can get a compiler to generate assembly instructions, where supported, for all functions in TBM/BMI and for BZHI from BMI2, which all can be described with simple expressions.

PDEP is non-trivial and has the same complexity as PDEP, with multiple possible implementations. So, neither seem straightforward for an optimizer to recognize.

The ABM instructions, POPCNT and LZCNTare also non-trivial, and implementations could not be recognized. Fortunately, we have std::popcount and std::countl_zero, respectively, which can map directly to the corresponding instruction (where the hardware supports it).

It would seem that PEXT and PDEP will likely be similarly supported before the time compilers can infer that an instruction can replace the algorithm.

Now that C++ is officially two's compliment, it would be nice to see both arithmetic and logical shift right wrapped so that one could explicitly use them on signed or unsigned, as required.

As far as PEXT implementations go, here's a variation that might compare favorably to TemplateRex's (Matthew Fioravante's) version at https://stackoverflow.com/a/21159523/2963099

template <typename Integral>
constexpr Integral extract_bits(Integral x, Integral mask) {
   Integral res=0;
   int bb=1;

   do {
        Integral lsb=mask & -mask;
        mask &= ~lsb;
        bool isset=x & lsb;
        res |= isset ? bb : 0;
        bb+=bb;
    } while (mask);
        
  return res;
}

You can compare them both on Compiler Explorer at https://godbolt.org/z/3h9WrYqxT

Mostly this breaks out the least significant byte (lsb) to remove it from mask, and test against x

It is safe to run through once with mask === 0. (lsb will be 0, so isset is false). Using a do while is much more efficient except for that trivial case.

Using the ternary operator is mostly stylist since, to me, it is a stronger hint of the intent to generate a cmove

Glenn Teitelbaum
  • 10,108
  • 3
  • 36
  • 80
  • Ok, blsi / test / cmove in a loop might be ok to emulate `pext` on machines that don't have it, perhaps even on AMD before Zen3 where `pext` is very slow (perhaps using microcode like this). But current compilers *don't* recognize as an idiom for `pext`, which was the goal. (Like how some compilers are able to recognize certain loops and compile them to a popcnt instruction). `pext` is 1 uop with 3-cycle latency on Intel CPUs that support it, and Zen3, so it's much faster than the average or worst case of this, and quite a bit better than even the best case of this. – Peter Cordes Aug 16 '22 at 23:22
  • Yes, there ready should be PEXT and PDEP library support because user code versions are much less efficient. That's what drew me to this lightly aged thread, I was curious if it is recognized PEXT now. Generally TBM/BMI is discovered and substituted. BMI2 seems beyond the current scope of the optimizers. – Glenn Teitelbaum Aug 17 '22 at 01:11
  • Is there idiom recognition for loops or bithack sequences equivalent to BMI1 lzcnt/tznct? It's not by instruction set, it's by instruction simplicity. e.g. two booleans can easily get recognized as BMI1 blsi or blsr or andn. And hopefully BMI2 bzhi, at least for C that covers all its corner cases. And of course `shlx` is trivial, as is `mulx` when a widening multiply is desired if its register usage is more convenient than `mul`. And recognizing rotate idioms for `rorx` is a solved problem, so the choice of rorx over rol/ror is just down to wanting it non-destructive. – Peter Cordes Aug 17 '22 at 01:40
  • 1
    Anyway, I commented because the question title asks for something that a compiler can recognize. If that part is still unsolved, it would be good to mention that. – Peter Cordes Aug 17 '22 at 01:41
  • Thanks, let me know if anything else is missing. – Glenn Teitelbaum Aug 17 '22 at 14:21
  • 1
    That's good discussion, but you're still talking like all BMI2 instructions are complicated; BZHI is somewhat hard to express with pure C, but doesn't require looping. And other than it and PDEP/PEXT, the rest are just no-flags shifts / multiply. https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set#BMI2_(Bit_Manipulation_Instruction_Set_2). – Peter Cordes Aug 17 '22 at 17:40
  • Thanks again, also cleaned up a bit. – Glenn Teitelbaum Aug 18 '22 at 17:15
  • 2
    Gcc does recognize some popcount loop patterns (see `number_of_iterations_popcount`). – Marc Glisse Aug 18 '22 at 17:56
  • btw, I posted to codereview and this came up: https://codereview.meta.stackexchange.com/questions/10829/how-much-modification-is-needed-to-be-considered-authoring-the-code – Glenn Teitelbaum Aug 19 '22 at 13:11