2

I would like to use the FMA instrinsics instructions _mm256_fmadd_pd(a, b, c), but my code has to run on different computers with or without FMA enabled. I cannot use a compile-time flag. So I would like to be able to write something like this:

__m256d a, b, c, x;
bool FMA_Enabled = CheckFMA();

if (FMA_Enabled)
{
  d = _mm256_fmadd_pd(a, b, c);
}
else
{
  x = _mm256_mul_pd(a, b);
  d = _mm256_add_pd(x, c);
}

I cannot find a way to write the function CheckFMA(). Is there a way to do this?

My OS is Windows 10 64 bits.

EDIT: The branching will actually be outside of the function. So I don't lose performance by checking the FMA support every time.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Sylvestre
  • 61
  • 6
  • 2
    Are you saying that increased multiplication performance outperforms branching (which otherwise is not necessary)? Have you measured this? – freakish Nov 19 '19 at 10:49
  • I will use the branching outside of the function. – Sylvestre Nov 19 '19 at 10:50
  • So, you want your compiled binary to contain instructions that the CPU potentially doesn't even know? This feels very, very wrong. – lisyarus Nov 19 '19 at 10:53
  • 6
    @lisyarus No, that is not wrong. Every CPU has API for feature detection and so it doesn't have to enter invalid instructions. – freakish Nov 19 '19 at 10:54
  • My program will contain a check that will ensure that not the not supported instructions aren't executed. The thing is, I have diverse computers and I want only one program that will execute everywhere. – Sylvestre Nov 19 '19 at 10:55
  • 2
    see https://stackoverflow.com/q/6121792/2747962 – Darklighter Nov 19 '19 at 11:00
  • 3
    Look at [`__cpuid` Microsoft intrinic](https://learn.microsoft.com/cs-cz/cpp/intrinsics/cpuid-cpuidex?view=vs-2019) and check for `AVX2` and `FMA` functions. – Daniel Langr Nov 19 '19 at 11:00
  • 1
    `__builtin_cpu_supports()` for gcc. – Shawn Nov 19 '19 at 11:07
  • 1
    @freakish Now that I think of it more, it starts to make sense, thank you. – lisyarus Nov 19 '19 at 11:15
  • 3
    @DanielsaysreinstateMonica: `_mm256_fmadd_pd` only requires the AVX and FMA3 feature bits, not AVX2. You don't want to exclude AMD Piledriver/Steamroller unnecessarily. Technically you need to check that the OS supports AVX (as well as the CPU), but a Windows program may be able to assume non-ancient Windows. Really you only need to check the FMA3 feature bit; it implies AVX because that's how its encoded. (As opposed to AMD's abandoned FMA4 feature) – Peter Cordes Nov 19 '19 at 11:17
  • 1
    Related: [AVX feature detection using SIGILL versus CPU probing](//stackoverflow.com/q/44144763) and [Which versions of Windows support/require which CPU multimedia extensions?](//stackoverflow.com/q/34069054) (An OS that supports AVX doesn't need to do anything special for user-space to use AVX2 and/or FMA, so checking their CPUID feature bits is sufficient) – Peter Cordes Nov 19 '19 at 11:25
  • 1
    Also, of course you wouldn't want to actually branch around 2 vs. 1 instruction on a value that's not known at compile time. And if using GCC or other compiler that will contract mul+add into FMA, be when you compile with FMA code-gen enabled be careful that you don't get FMA on both paths. You probably want different whole functions with different target options/attributes. Or in MSVC, the compiler doesn't optimize intrinsics so I think you just need /arch:AVX and you can use FMA inside functions that are only called on CPUs with FMA enabled. – Peter Cordes Nov 19 '19 at 11:28
  • @PeterCordes Of course I won't do the branching at the lowest level. I wil do it before the whole function making the computation is called. I wrote an edit to my post to precise this. – Sylvestre Nov 19 '19 at 12:25

2 Answers2

4

I used __cpuid to code my function by modifying the microsoft code. Thank you very much to all for your help.

#include <intrin.h>
#include <vector>
#include <bitset>
#include <array>

bool CheckFMA()
{
    std::array<int, 4> cpui;
    std::bitset<32> ECX;
    int nIds;
    bool fma;

    __cpuid(cpui.data(), 0);
    nIds = cpui[0];

    if (nIds < 1)
    {
        return false;
    }

    __cpuidex(cpui.data(), 1, 0);
    ECX = cpui[2];

    return ECX[12];
}
Sylvestre
  • 61
  • 6
  • 2
    You don't need to call `cpuid` in a loop to enumerate all possible outputs. You only need `__cpuidex(cpui.data(), 1, 0)` inside the `if(nIds_ >= 1)` for the one leaf you ever read. It's not a huge disaster for performance, e.g. [Ice Lake only has `0x1B`](http://users.atw.hu/instlatx64/GenuineIntel/GenuineIntel00706E5_IceLakeY_CPUID.txt) "basic" / low-numbered CPUID leaves to enumerate. So it's not going to make startup take extra milliseconds. – Peter Cordes Nov 20 '19 at 09:47
  • @PeterCordes. Yes, thank you. I will correct this in my post. – Sylvestre Nov 20 '19 at 12:34
  • 1
    Don't leave the clunky version in the answer, just show the good version. Or at *least* put it first. Edit to create the answer you should have posted in the first place; edit history is there if anyone wants to look. – Peter Cordes Nov 20 '19 at 12:52
  • 1
    BTW, in theory (if your code could possibly run under an ancient OS) you need to check that OS support for AVX is enabled. [Which versions of Windows support/require which CPU multimedia extensions?](//stackoverflow.com/q/34069054). The OS has to set a bit in the CPU to make them not fault. (This avoids the failure mode of corrupted AVX upper halves on context switches on OSes that don't know about that new architectural state.) – Peter Cordes Nov 24 '19 at 07:22
1

Which OS? Running linux you could check /proc/cpuinfo for e.g. fma flag

Using Windows take a look at https://learn.microsoft.com/en-us/sysinternals/downloads/coreinfo which uses GetLogicalProcessorInformation function

Marc Stroebel
  • 2,295
  • 1
  • 12
  • 21