1

Given two same size vectors of uint64. I simply wants to compute the dot-product between these vectors. As it will require 128 bit arithmetic to store the intermediate results. I come up with a simple function

void multiply_acum(uint64_t op1, uint64_t op2, uint128_t& product_acum) {
     product_acum =  product_acum + (static_cast<uint128_t>(op1) * static_cast<uint128_t>(op2)); }

and call this function from dot-product function (may not be working providing just for reference)

     void dot_product(int size, vector<uint64> a, vector<uint64> b){
       uint128_t product_acum = 0;
       for (int i = 0; i < size; i++){
        multiply_acum(a[i], b[i], product_acum);
      }}

This function is working fine but 128 bit arithmetic is quite slow. Any suggestions on improving the performance of if could avoid 128 bits arithmetics and use 64 bits instead as explained here Getting the high part of 64 bit integer multiplication

cigien
  • 57,834
  • 11
  • 73
  • 112
CryptoKitty
  • 654
  • 4
  • 20
  • 1
    Just let the optimizer have a go at your current `dot_product` function and I expect it will get very fast (because it doesn't return anything so it will be optimized to a null function). – The Photon May 03 '23 at 04:34
  • setting "-O3" flag improved the speed by at least 2x – CryptoKitty May 03 '23 at 05:23
  • 2
    Why don't you pass your dot product vectors by reference? Now you have unnecessary copying which will also slow things down. Note : uint128_t is not a standard type yet (so your code might not be portable between compilers). – Pepijn Kramer May 03 '23 at 07:40
  • 2
    Maybe you can also use : https://en.cppreference.com/w/cpp/algorithm/inner_product – Pepijn Kramer May 03 '23 at 07:45
  • Have you already tried [std::transform_reduce](https://en.cppreference.com/w/cpp/algorithm/transform_reduce)? It might take advantage of the [execution policy](https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag_t). – Bob__ May 03 '23 at 07:56
  • the technique in the q&a you link would yield `0` for eg `1*1` and many other input. If thats really what you want then you just need to shift the input as explained in the answers there. What stops you from implementing it to see how it performs? – 463035818_is_not_an_ai May 03 '23 at 08:07
  • "128 bit arithmetic is quite slow" On x64 it's [one instruction for the multiplication and two instructions for the addition](https://godbolt.org/z/1rvaseerc). It doesn't get any faster than that. I would guess it is basically the same on any non-broken hardware with 64-bit arithmetic. – n. m. could be an AI May 03 '23 at 08:53
  • @463035818_is_not_a_number did that actually performance is pretty much same – CryptoKitty May 03 '23 at 18:28
  • @Bob__ dont wanna use parallelization , needs a single core performance – CryptoKitty May 03 '23 at 18:30
  • whats the issue then? Did you actually profile your code and identified this part as a hot spot? Consider that often a scalar product is a hot spot, but not because it is inefficient, but merely because how often it is called in typical number crunching code. – 463035818_is_not_an_ai May 04 '23 at 07:24

1 Answers1

1

I've implemented very fast solution that uses intrinsics _mul128() and _addcarry_u64().

These two intrinsics allow 128-bit arithmetics to be implemented as fast as possible.

I did two version of code of MulAdd() function, one for MSVC, one for GCC/Clang.

Try it online!

#include <vector>
#include <cstdint>
#include <string>
#include <iostream>
#include <iomanip>
#include <sstream>
#include <random>

#include <immintrin.h>
#ifdef _MSC_VER
    #include <intrin.h>
#endif

using u64 = uint64_t;
#ifndef _MSC_VER
    using u128 = unsigned __int128;
#endif

void MulAdd(u64 a, u64 b, u64 & acc_hi, u64 & acc_lo) {
    #ifdef _MSC_VER
        u64 hi, lo = _mul128(a, b, (int64_t*)&hi);
        acc_hi += hi + _addcarry_u64(0, acc_lo, lo, &acc_lo);
    #else
        u128 res = u128(a) * b;
        acc_hi += u64(res >> 64) + _addcarry_u64(
            0, acc_lo, u64(res), (unsigned long long *)&acc_lo);
    #endif
}

void Add(std::vector<u64> const & a, std::vector<u64> const & b,
        u64 & acc_hi, u64 & acc_lo) {
    for (size_t i = 0; i < a.size(); ++i)
        MulAdd(a[i], b[i], acc_hi, acc_lo);
}

std::string ToStr(u64 lo, u64 hi = 0) {
    std::ostringstream ss;
    if (hi != 0)
        ss << std::setw(21) << hi;
    ss << std::setfill('0') << std::setw(21) << lo;
    return ss.str();
}

int main() {
    std::mt19937_64 rng{std::random_device{}()};
    std::vector<u64> a, b;
    for (size_t i = 0; i < 3; ++i) {
        a.push_back(rng());
        b.push_back(rng());
    }
    u64 acc_lo = 0, acc_hi = 0;
    Add(a, b, acc_hi, acc_lo);
    for (size_t i = 0; i < a.size(); ++i)
        std::cout << ToStr(a[i], 0) << " ";
    std::cout << std::endl << "*+" << std::endl;
    for (size_t i = 0; i < a.size(); ++i)
        std::cout << ToStr(b[i], 0) << " ";
    std::cout << std::endl << "=" << std::endl
        << " " << ToStr(acc_lo, acc_hi) << std::endl;
}

Output:

012590278628230096604 011309271885181787439 016272747539308973614 
*+
015736330376908751977 007340677540487466610 016269700642174965716 
=
  11146295818150996727001455841159592743234
Arty
  • 14,883
  • 6
  • 36
  • 69