0

I'm trying to determine if the _mm512_mullox_epi64 (AVX-512 foundation) sequence intrinsic is substantially slower than the _mm512_mullo_epi64 (AVX-512 Double-Word and Quad-Word ISA) hardware-implemented intrinsic.

The _mm512_mullo_epi64 will throw an "Invalid Instruction" exception on hardware with AVX-512 but without the DWQW instruction set extensions.

I don't have an AVX-512 capable CPU and trying to benchmark using godbolt provides very inconsistent results. My code also does not compile using quick bench as you can't currently pass in compiler options like -mavx512dq

I'm also interested in knowing whether or not there is a good option for using AVX2 as there is no intrinsic for multiplying 64-Bit integers with AVX2.

Using _mm256_mul_pd with a cast often produces incorrect results when the product is within the bounds of an int64_t but outside of the bounds of a 64-bit double.

Here's my test code if you're interested:

 #include "immintrin.h"
 #include <cstdint>
 #include <array>
 #include <algorithm>
 #include <numeric>
 #include <iostream>
 #include <chrono>

 std::array<int64_t, 1000000> arr1;
 std::array<int64_t, 1000000> arr2;
 std::array<int64_t, 1000000> arr3;

 class Timer
{
public:
    Timer()
    {
        start = std::chrono::high_resolution_clock::now();
    }//End of constructor

    Timer(Timer const&) = delete;
    Timer& operator=(Timer const&) = delete;
    Timer(Timer&&) = delete;
    Timer& operator=(Timer&&) = delete;

    ~Timer()
    {
        end = std::chrono::high_resolution_clock::now();
        std::chrono::high_resolution_clock::duration d = end - start;
        std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(d).count() <<      "ns\n";
    }//End of destructor

private:
    std::chrono::high_resolution_clock::time_point start;
    std::chrono::high_resolution_clock::time_point end;
};//End of class Timer

 template<uint64_t SIZE1, uint64_t SIZE2, uint64_t SIZE3>
 void mul_f(const std::array<int64_t, SIZE1>& src, const std::array<int64_t, SIZE2>& src2,           std::array<int64_t, SIZE3>& dest)
 {
__m512i _src1;
__m512i _src2;
__m512i _dest;

for(uint64_t i = 0; i < SIZE3; i+=8)
{
    if((i + 8) > SIZE3)
    {
        break;
    }

    _src1 = _mm512_load_epi64(&src[i]);
    _src2 = _mm512_load_epi64(&src2[i]);

    _dest = _mm512_mullox_epi64(_src1, _src2);

    _mm512_store_epi64(&dest[i], _dest);        
     }
 }

 template<uint64_t SIZE1, uint64_t SIZE2, uint64_t SIZE3>
 void mul_dq(const std::array<int64_t, SIZE1>& src, const std::array<int64_t, SIZE2>& src2, std::array<int64_t, SIZE3>& dest)
 {
__m512i _src1;
__m512i _src2;
__m512i _dest;

for(uint64_t i = 0; i < SIZE3; i+=8)
{
    if((i + 8) > SIZE3)
    {
        break;
    }

    _src1 = _mm512_load_epi64(&src[i]);
    _src2 = _mm512_load_epi64(&src2[i]);

    _dest = _mm512_mullo_epi64(_src1, _src2);

    _mm512_store_epi64(&dest[i], _dest);        
}
 }

 template<uint64_t SIZE1, uint64_t SIZE2, uint64_t SIZE3>
 void mul_avx2(const std::array<int64_t, SIZE1>& src, const std::array<int64_t, SIZE2>& src2, std::array<int64_t, SIZE3>& dest)
 {
__m256i _src1;
__m256i _src2;
__m256i _dest;

for(uint64_t i = 0; i < SIZE3; i+=4)
{
    if((i + 4) > SIZE3)
    {
        break;
    }

    _src1 = _mm256_load_si256((__m256i*)&src[i]);
    _src2 = _mm256_load_si256((__m256i*)&src2[i]);

    int64_t d[4] = {};
    for (size_t x = 0; x != 4; ++x)
    {
 #ifdef _WIN32
        d[x] = _src1.m256i_i64[x] * _src2.m256i_i64[x];
 #else
        d[x] = _src1[x] * _src2[x];
 #endif
    }//End for

    _dest = _mm256_load_si256((__m256i*) &d);

    _mm256_store_si256((__m256i*)&dest[i], _dest);        
}
 }



 int main()
 {
std::iota(arr1.begin(), arr1.end(), 5);
std::iota(arr2.begin(), arr2.end(), 2);

{   
    Timer();
     mul_f(arr1, arr2, arr3);
}

{
    Timer();
    mul_dq(arr1, arr2, arr3);
}

{
    Timer();
    mul_avx2(arr1, arr2, arr3);
}

return static_cast<int>(arr3[0]);
 }

Thanks in advance for your assistance.

dave_thenerd
  • 448
  • 3
  • 10
  • 2
    *hardware with AVX-512 but without the DWQW* - Just for the record, the only AVX-512 CPUs (so far) without AVX-512DQ are the discontinued Xeon Phi cards. (https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512). I'd guess that if AMD ever does implement AVX-512, they'd include AVX-512DQ like Intel has on their mainstream CPUs. You can predict the performance with static analysis tools like LLVM-MCA, or with https://uops.info/table.html. (The 64-bit element size `vpmullq` takes 3 uops per instruction on current Intel, vs. 2 for `vpmulld` or 1 for `vpmuludq`.) – Peter Cordes Aug 09 '21 at 03:39
  • 3
    `mullox` is implemented using 32-bit multiplies and "schoolbook" long multiplication, so it does seem like it'd be considerably slower: https://godbolt.org/z/We88796Kn – Nate Eldredge Aug 09 '21 at 03:42
  • 1
    Also related: you can do extended-precision stuff using the `double`-precision mantissa multipliers with clever bit-manipulation of FP bit patterns, and FMAs. But if 64-bit precision is enough but one `double` (52/53-bit) isn't, then yeah 64-bit integer might be the way to go, even though it's inconvenient and slow to multiply. – Peter Cordes Aug 09 '21 at 03:50
  • 2
    [Fastest way to multiply an array of int64\_t?](https://stackoverflow.com/q/37296289) has a manually-vectorized AVX2 version that should be significantly better than extracting to scalar, especially between other vector operations. – Peter Cordes Aug 09 '21 at 03:51
  • OK, thanks guys! I'm not sure if I'll upgrade my code to use _mm512_mullo_epi64 yet though, Xeon Phi is still pretty new and if it were me, I'd want my $2000+ CPU to not be obsolete after 5 years. – dave_thenerd Aug 14 '21 at 06:14

0 Answers0