1

In C++, is there a way to query number of lanes of SIMD units like this:

// 4 for bulldozer, 
// 8 for skylake, 
// 16 for cascadelake
int width = std::this_thread::SIMD_WIDTH; 

or does it have to be a non-portable code path? I would like to test some optimizations on a code but I guess keeping tiling-length (4/8/16/32) fixed for a vectorized loop is not good.

By "natural" width, I mean optimal performance. For example, bulldozer has avx capability by joining two cores FPU together. But some operations are better on SSE4 for that. Also Cascadelake can run SSE efficiently but not as efficient as AVX512 (just a guess).

On a more "advanced" point, does it work for CPUs with different cores in same package? For example, newest Intel arch has some efficient cores and performance cores at the same time. What if I query the length as 256 bits (8 lanes for 32bit fp) and OS schedules the thread on a different core that has wider SIMD?

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
  • 3
    standard C++ knows nothing about SIMD. You'll need to build/find a library that can get that information. – NathanOliver Apr 01 '22 at 11:55
  • 2
    Assuming x86 based (including x86-64) you can use the `cpuid` instruction (or a compiler-supplied intrinsic function) to get information about the CPU and from that make the deductions needed. – Some programmer dude Apr 01 '22 at 11:58
  • 1
    @NathanOliver there is experimental support for SIMD: https://en.cppreference.com/w/cpp/header/experimental/simd -- but I think this only allows compile-time detection of the optimal size: https://godbolt.org/z/W38G18xxz – chtz Apr 01 '22 at 12:11
  • @chtz will this experimental be official later? Looks good. (would be better if it had run-time query too just in case of asymmetric cpus) – huseyin tugrul buyukisik Apr 01 '22 at 12:12
  • 3
    Virtually no one is prepared for asymmetric CPUs yet. – Alex Guteniev Apr 01 '22 at 14:07
  • 1
    *bulldozer has avx capability by joining two cores FPU together* - That's an extreme distortion of two different aspects of AMD's Bulldozer-family design. First, 256-bit operations decode to two 128-bit uops, just like Zen1. (And like Intel did for 128-bit operations as 64-bit halves for SSE on early P6-family, Pentium 3 / M until Core 2). Second, a pair of weak integer cores share a single group of SIMD execution units, unlike in Zen where each core is separate but supports SMT (so a physical core can act as two logical cores), another way of exploiting thread-level parallelism. – Peter Cordes Apr 02 '22 at 00:20
  • 1
    Depending on the CPU and workload, using 256-bit AVX instructions on Zen1 or an Alder Lake's E-core can still be efficient. (Especially if you aren't having to shuffle across lanes much.) Especially on Zen1 where the front-end is 5 instructions or 6 uops wide, so max uop throughput is only available with some multi-uop instructions, at least that's my understanding. – Peter Cordes Apr 02 '22 at 00:24
  • 1
    So deciding when to use less than the max HW-supported width is a matter of tuning, not just detecting capabilities. (e.g. for compile-time choice, gcc/clang have `-mprefer-vector-width=256` implied by `-march=skylake-avx512`.) Partly that's [SIMD instructions lowering CPU frequency](https://stackoverflow.com/q/56852812) - using 512-bit instructions automatically in a program that doesn't spend a lot of its time in SIMD loops isn't a good idea. But if it does, it can be worth it. – Peter Cordes Apr 02 '22 at 00:25

0 Answers0