3

C++20 introduces many new functions such as std::popcount, I use the same functionality using an Intel Intrinsic.

I compiled both options - can be seen in Compiler Explorer code:

  1. Using Intel's AVX2 intrinsic
  2. Using std::popcount and GCC compiler flag "-mavx2"

It looks like the generated assembly code is the same, besides of the type checks used in std's template.

In terms of OS agnostic code and having the same optimizations - Is it right to assume that using std::popcount and the apt compiler vector optimization flags is better than directly using intrinsics?

Thanks.

AndyG
  • 39,700
  • 8
  • 109
  • 143
joepol
  • 744
  • 6
  • 22
  • Note that the `popcnt` instruction is not part of AVX2. It is part of the `POPCNT` instruction extension, and has been available on all Intel processors since the Nehalem architecture, well before AVX2 was introduced. – Jason R Jan 05 '21 at 14:57
  • You can check the generated assembly. For example, GCC generated `popcnt` with `-march=native` and `-O2` in my case: https://godbolt.org/z/x3e7Wa. (Just don't understand why `eax` is zeroed first; `popcnt` should replace the whole content of `rax`, or shouldn't?) – Daniel Langr Jan 05 '21 at 15:27
  • If you check some implementations, the code in [libstdc++](https://github.com/gcc-mirror/gcc/blob/6b577a17b26347e78c8b9167f24fc5c9d9724270/libstdc%2B%2B-v3/include/std/bit#L190) aka the STL from GCC AND the code in [libc++](https://github.com/llvm-mirror/libcxx/blob/78d6a7767ed57b50122a161b91f59f19c9bd0d19/include/bit#L317) aka the STL from LLVM/Clang, you see they don't call the intrinsics directly. – JVApen Jan 05 '21 at 18:50
  • 2
    @DanielLangr [popcnt has a false dependency](https://stackoverflow.com/questions/21390165/why-does-breaking-the-output-dependency-of-lzcnt-matter) on the destination register. – aqrit Jan 05 '21 at 20:48
  • @aqrit Thanks for that link; I wasn't aware of this problem and it's quite interesting. – Daniel Langr Jan 06 '21 at 05:34

1 Answers1

6

Technically No. (But practically, yes). The C++ standard only specifies the behavior of popcount, and not the implementation (Refer to [bit.count]).

Implementors are allowed to do whatever they want to achieve this behavior, including using the popcnt intrinsic, but they could also write a while loop:

int set_bits = 0;
while(x)
{
   if (x & 1)
      ++set_bits;
   x >>= 1;
}
return set_bits;

This is the entire wording in the standard at [bit.count]:

template<class T>
constexpr int popcount(T x) noexcept;

Constraints: T is an unsigned integer type ([basic.fundamental]).
Returns: The number of 1 bits in the value of x.

Realistically? Compiler writers are very smart and will optimize this to use intrinsics as much as possible. For example, gcc's implementation appears to be fairly heavily optimized.

AndyG
  • 39,700
  • 8
  • 109
  • 143
  • I am confused. You start with **No**, but then the rest of your post basically concludes **Yes**. I mean, under the C++ standard, `int x = 7+3;` is legally allowed to be implemented as a for loop in assembly, but you don't see people saying to use intrinsics or assembly to add integers, do you? – Yakk - Adam Nevraumont Jan 05 '21 at 19:19
  • @Yakk-AdamNevraumont: I think you know even better than I do that the answer is "Technically no, but practically yes" – AndyG Jan 05 '21 at 19:21
  • 1
    I'm just saying that maybe you should start with **Yes** rather than **No** ;) Someone might get misled. – Yakk - Adam Nevraumont Jan 05 '21 at 19:21
  • Ugh, the popcount `if(__x == 0) return 0;` compiles badly with clang when hardware `popcnt` is available, actually checking FLAGS from popcnt and doing a useless cmov. https://godbolt.org/z/KqjPe3. I guess it might be worth branching around GCC's fallback bithack, but gcc still uses cmov there! – Peter Cordes Jan 06 '21 at 05:47
  • (Also, glibc's implementation of `__has_single_bit` as `popcount(x) == 1` right after that in the source is also questionable, although I guess it needs to reject `0` so after BMI1 `blsr` you'd have to check both ZF==1 (result = 0) and CF==0 (input was non-zero). `setnbe` checks for ZF==0 and CF==0. So not quite. even with cmc, `setbe` checks ZF==1 || CF==1, not and. Anyway, clang even turns my pure-C attempt at that back into popcnt / cmp (but without the stupid `__x == 0` part)) – Peter Cordes Jan 06 '21 at 05:51
  • 1
    libc++ doesn't include that `__x==0` which makes things worse for compilers; clang `-stdlib=libc++` is good https://godbolt.org/z/8eo64c (but doesn't try to avoid the popcnt output false dependency that exists on SnB through Skylake, not fixed until Ice Lake. clang is typically reckless with false dependencies, which does save instructions in cases where they don't create a problem.) I assume the `__x==0` check would also defeat clang's auto-vectorization of `__builtin_popcnt` (over arrays when AVX2 is available.) – Peter Cordes Jan 06 '21 at 05:58