std::async is very slow

Question

i have successfully multithreaded my code but the result is very slow , i there are no thread conflict and yet i am getting very slow code. the program is to use gaussian elimination to solve system of equations , and i have used async to parallelize the matrix row operations.
asynchronous:

 void ge(Mat &mat)
    {

        using li = long int;

        for (li p = 0; p < mat[0].size() - 1; p++)
        {
            std::vector<std::future<void>> ts;

            for (li c = p + 1; c < mat.size(); c++)
            {
                auto x = mat[c][p] / mat[p][p];

                auto temp = vecMul(x, mat[p]);
                // vecSub(mat[c], temp);
                ts.push_back(std::async(vecSub, std::ref(mat[c]), (temp)));
            }
            for (auto &t : ts)
            {

                t.get();
            }
        }
    }

for sequencial execution:

 void ge(Mat &mat)
    {

        using li = long int;

        for (li p = 0; p < mat[0].size() - 1; p++)
        {
           
            for (li c = p + 1; c < mat.size(); c++)
            {
                auto x = mat[c][p] / mat[p][p];

                auto temp = vecMul(x, mat[p]);
                vecSub(mat[c], temp);
            }
        }
    }

full code:

#include <bits/stdc++.h>
using Vec = std::vector<double>;
using Mat = std::vector<Vec>;
using eqn = Mat;
class solver
{
    Mat mat;

public:
    //give eqn in the form ax1+ax2+ax3..axN = k (coeffiants only)
    Vec solve(Mat &in)
    {

        mat = in;

        ge(mat);
        return (bs(mat));
    }
    Vec solve(Mat &&in)
    {
        mat = std::move(in);
        ge(mat);
        return (bs(mat));
    }

private:
    void ge(Mat &mat)
    {

        using li = long int;

        for (li p = 0; p < mat[0].size() - 1; p++)
        {
            std::vector<std::future<void>> ts;

            for (li c = p + 1; c < mat.size(); c++)
            {
                auto x = mat[c][p] / mat[p][p];

                auto temp = vecMul(x, mat[p]);
               // single thread vecSub(mat[c], temp);
               ts.push_back(std::async(vecSub, std::ref(mat[c]), (temp)));
            }
            for (auto &t : ts)
             {
                t.get();
             }
        }
    }
    Vec bs(Mat &mat)
    {
        using li = long int;
        Vec x(mat.size());
        for (li i = mat.size() - 1; i >= 0; i--)
        {
            double s = 0;
            for (li j = i; j < mat[0].size() - 1; j++)
            {
                s += mat[i][j] * x[j];
                x[i] = ((mat[i][mat[0].size() - 1] - s) / (mat[i][i]));
            }
        }
        return x;
    }
    static Vec vecMul(double a, Vec b)
    {
        using li = size_t;
        for (li i = 0; i < b.size(); i++)
            b[i] *= a;
        return b;
    }
    //static
    static void vecAdd(Vec &a, Vec &b)
    {
        using li = size_t;
        assert(a.size() == b.size());
        for (li i = 0; i < a.size(); i++)
            a[i] = a[i] + b[i];
    }
    static void vecSub(Vec &a, Vec b)
    {
        using li = size_t;
        // assert(a.size() == b.size());
        for (li i = 0; i < a.size(); i++)
            a[i] = a[i] - b[i];
    }
};

edit 1:
i tried using std::launch::async
edit 2: single thread:
time taken for size 3 is 3.3929e-05
time taken for size 53 is 0.00372395
time taken for size 103 is 0.0146523
time taken for size 153 is 0.0320243
time taken for size 203 is 0.0702842
time taken for size 253 is 0.129702
time taken for size 303 is 0.219656
time taken for size 353 is 0.33951
time taken for size 403 is 0.496915
time taken for size 453 is 0.697524

multi thread:
time taken for size 3 is 0.000349948
time taken for size 53 is 0.0560127
time taken for size 103 is 0.160197
time taken for size 153 is 0.375889
time taken for size 203 is 0.663671
time taken for size 253 is 1.04643
time taken for size 303 is 1.52449
time taken for size 353 is 2.10555
time taken for size 403 is 2.7029
time taken for size 453 is 3.50366

Async approaches to CPU-bound tasks are usually the wrong way to tackle them and only make things worse. Async is for tasks that can't proceed, they're stuck waiting on IO. — tadman, Mar 17 '21 at 06:00
[Why should I not #include ?](https://stackoverflow.com/questions/31816095/why-should-i-not-include-bits-stdc-h) — TruthSeeker, Mar 17 '21 at 06:04
1.) Did you compile with optimizations on? 2.) Any reason to take vectors by value? 3.) What's size of the input? If it's small the overhead of using async will overshadow the potential benefits, if the input is large you pay a lot for taking vectors by value. — Lukas-T, Mar 17 '21 at 06:39
i have given it in test form , and the benchmarks are using std::launch::async — Barath Keshav, Mar 17 '21 at 07:41
@churill 1) no ill give it a go 2) yes because i dont want to modify the origianal value of matrix and if i give temp as reference , segmentation fault will occur since it would have gone out of scope as was with my previous problem [here](https://stackoverflow.com/questions/66661380/stdasync-is-causing-segmentation-fault-or-slow/66662456?noredirect=1#comment117854692_66662456) — Barath Keshav, Mar 17 '21 at 09:59
@tadman but isn't async a implementation based on std::thread and its a implementation of pthread??? and how do matrix libraries use all threads increase speed of execution?? are you telling using async is not good or generally using multithreading for this task is bad?? — Barath Keshav, Mar 17 '21 at 10:02
You use threads to use more than one CPU core, something relevant on CPU-bound tasks. You can use async to help reduce the number of threads you need, but that's only possible if you're not heavily utilizing them. Think of async as a way of collapsing 10 threads only 10% utilized into 1 thread fully utilized. It does nothing in cases where you have threads already pinned at 100%. — tadman, Mar 17 '21 at 17:27
To get more performance out of this kind of code: Use vector math (not `std::vector`, but SIMD instructions) either directly or via a pre-existing high-performance numerical library, or use GPU compute. A CUDA/OpenCL solution is often vastly faster than anything a CPU can do, even under ideal circumstances. — tadman, Mar 17 '21 at 17:29

std::async is very slow

0 Answers0

Linked