2

According to this link there are no predefined preprocessor symbols for AVX512 ( MSVC 2017 )

I'm trying to build thundersvm which uses eigen library on (you guessed it) windows. Both Eigen and thundersvm use cmake and depinding on the compiler prerpocessor symbols, Eigen compiles with avx512 instructions or not.

It seems that using /arch:AVX512 doesn't trigger any errors in MSVC but doesn't define __AVX512F__ symbol which Eigen needs. I also tried to include -D__AVX512F__=ON in the cmake arguments but still no luck.

Since there is no predefined preprocessor symbol for AVX512, is there any way to force Eigen to compile with avx512?


Update

According to chtz comment I've checked out the default branch of Eigen and recompiled thundersvm with arch:AVX512 with this cmake arguments (maybe not all are needed):

-DUSE_CUDA=OFF -DUSE_EIGEN=ON -DBUILD_SHARED_LIBS=OFF -DEIGEN_ENABLE_AVX512=ON -D__AVX512F__=ON -DEIGEN_VECTORIZE_AVX512=ON -DEIGEN_VECTORIZE_AVX2=ON -DEIGEN_VECTORIZE_AVX=ON -DEIGEN_VECTORIZE_FMA=ON

Comparing instruction mix from Intel's SDE -mix tool before and after the patch I can clearly see that AVX instructions are used (SDE complains it doesn't recognise instruction vbroadcastss zmm0, xmm0 when running for skl cpu but works fine for skx). The problem is that MSVC uses the scalar version of AVX and there is no improvement in the runtime(also the number of total instructions is the same) which is similar to this post

Are there other flags I need to define so that MSVC generates non scalar instrucions ? (I think I'll also give gcc a try)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user1934513
  • 693
  • 4
  • 21
  • 2
    Worth mentioning that MSVC's AVX512 support has been virtually unuseably buggy until very recently. So I wouldn't rule out the possibility that the VS team intentionally left out a flag while the "beta testers" sorted out all the issues. I've personally found like at least 10 AVX512-related bugs. – Mysticial Feb 14 '19 at 20:00
  • @Mysticial is there a way to become a beta tester ? – user1934513 Feb 14 '19 at 20:44
  • 1
    I meant it only figuratively. VS announced support for AVX512 back in July 2017. But it wasn't really usable until ~November of last year. So everybody who tried to use it during that period were essentially "beta testers" as they were usually debugging the compiler's bugs more than their own bugs. – Mysticial Feb 14 '19 at 21:10
  • 3
    @user1934513: MSVC is *not* the only compiler that can make Windows binaries. clang or gcc are free and usually make efficient code. (Use `-O3 -march=native -ffast-math`, and maybe `-mprefer-vector-width=512` if you program will spend most of its time in code that can benefit from wide vectors. (The default when tuning for Skylake-X is 256-bit vectors, to avoid gimping max turbo for the whole program when auto-vectorizing a random loop that's actually not important.) – Peter Cordes Feb 15 '19 at 02:02
  • `/arch:AVX512` should do the job, but we need to fix a relatively simple issue in Eigen first: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1678 Should be done in the next days. – chtz Feb 15 '19 at 08:59
  • 3
    Could you try the most recent head version of Eigen (and compile with `/arch:AVX512`)? Gael just pushed two fixes which should make this work now. – chtz Feb 15 '19 at 13:10
  • @chtz I recompiled using the patch you sugested and indeed things changed but not from performance point of view, seems to be the same issue we discussed in the previous question you answered. See my edit – user1934513 Feb 15 '19 at 15:55
  • 1
    `vbroadcastss zmm0, xmm0` is an AVX512F instruction (so of course SKL = Skylake-client doesn't support it). Clearly MSVC is using some vector instructions. I assume you meant that it's *mostly* using scalar, not able to auto-vectorize the important loops that your code uses, only unimportant ones, if you're still testing the same SVM benchmark. And BTW, do you mean you're getting scalar AVX512F instructions? Or is it still using scalar AVX instruction? (That's more likely, it's only a win to use the longer EVEX encoding to access more registers.) – Peter Cordes Feb 15 '19 at 16:05
  • @PeterCordes I indeed meant that it is using scalar AVX instructions and not beeing able to auto-vectorize. – user1934513 Feb 15 '19 at 16:12
  • 1
    Can you try to benchmark a simple program which should clearly profit from AVX512f, e.g., just multiply two large matrices (for too large matrices, caching might limit the effect, though). Eigen clearly uses AVX512 for GEMM even with MSVC: https://godbolt.org/z/6BD80z (Note: MSVC creates a lot of noise as it generates assembly for many unused inline-functions. Also the manual fixes in that link might break compilation as soon as godbolt updates its Eigen-library) – chtz Feb 15 '19 at 16:34
  • @chtz: a well-optimized matmul should bottleneck on FMA throughput, not memory bandwidth, even for very large sizes. Classic use-case for loop tiling (aka cache blocking), with O(N^3) computation over N^2 memory for dense square matrices. Or depending on how optimization is done, 2x load + 1x store per FMA will bottleneck on L1d cache bandwidth and store uop throughput, but still benefit by a factor of 2 from wider vectors. (Or less because of reduced turbo clocks). I think there are techniques that can do 2-per-clock FMA, not looping over output elements doing `dst[i] += stuff` – Peter Cordes Feb 15 '19 at 17:03
  • 1
    @PeterCordes Last time I checked, AVX is still unusable on Windows with GCC. They still haven't fixed the stack-alignment bug. But supposedly Clang has. – Mysticial Feb 15 '19 at 17:56
  • @chtz I did the benchmark you suggested and indeed the generated binary has 6 bilion out of 8 bilion total instructions categorized as avx256 and avx128 by SDE tool (I used arch:AVX2 for now). So I believe the issue that I see is caused by the cmake files used by thundersvm project. – user1934513 Feb 18 '19 at 16:41
  • I never used/compiled thundersvm. It could also be that they barely use operations which would benefit from AVX512 instead of AVX256. E.g., at the moment when multiplying `4x4 double` matrices Eigen should use AVX256 and not AVX512 (with some swizzling, AVX512 might be better, but AVX512 support inside Eigen is also far from perfect) – chtz Feb 18 '19 at 17:18
  • `/arch:AVX512` [does define `__AVX512F__`](https://learn.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-160) for sometime now. https://godbolt.org/z/vvj7oT – phuclv Nov 24 '20 at 06:53

1 Answers1

3

MSVC has poor support for AVX-512 and no distinction between the different subsets. There is no safe way to produce AVX512F code on MSVC without also possibly making AVX512DQ instructions.

The best compilers for AVX-512 are gcc and clang. There is a Clang plugin to Visual Studio that you can use if you like the IDE. The gcc and clang compilers have preprocessor symbols like __AVX512F__, __AVX512VL__, etc.

phuclv
  • 37,963
  • 15
  • 156
  • 475
A Fog
  • 4,360
  • 1
  • 30
  • 32