when adding more threads to my parallel program (C++ OpenMp with Cplex library) it becomes slower

Question

I have a program having a function that has a for loop, at each iteration the loop solves a Cplex problem. every problem has some similar constraints with the others, and has also its particular constraints. I used the command #pragma omp parallel for so that we solve each one of them in parallel, and we did set each Cplex environment to use only one thread. And to avoid false sharing I created a copy of the values needed on each thread, for the read and write I used #pragma omp critical. Even with all that when I use 1 thread it gives the best time result, and when I increase it becomes slower.

void solving(vector<struct block> &blocks,
             vector<vector<double>> A,
             vector<double> b,
             vector<vector<double>> C,
             vector<double> V,
             vector<vector<double>> ZZ,
             vector<double> Mm,
             double l,
             bool &change,
             vector<bool> &sign) {

    int tt = th;

    omp_set_num_threads(tt);

#pragma omp parallel for
    for (int ii = 0; ii < tt; ii++) {

        int TT;

        vector<struct block> blocks1;
        vector<vector<double>> A1;
        vector<double> b1;
        vector<vector<double>> C1;
        vector<double> V1;
        vector<vector<double>> ZZ1;
        vector<double> Mm1;
        double l1;
        vector<bool> sign1;
        double nvar1;
        int p1;
        int m1;

#pragma omp critical
        {
            TT = th;
            blocks1 = blocks;
            A1 = A;
            b1 = b;
            C1 = C;
            V1 = V;
            ZZ1 = ZZ;
            Mm1 = Mm;
            l1 = l;
            sign1 = sign;
            nvar1 = nvar;
            p1 = p;
            m1 = m;
        }

        IloEnv myenv1;            // environment object
        IloModel mymodel(myenv1); // model object

        IloNumVarArray X(myenv1, nvar1);
        for (int i = 0; i < nvar1; i++) {
            // X[i] = IloNumVar(myenv1, 0, IloInfinity, ILOINT);
            X[i] = IloNumVar(myenv1, 0, 1, ILOBOOL);
        }

        IloArray<IloNumVarArray> Y(myenv1, l1);
        for (int i = 0; i < l1; i++) {
            Y[i] = IloNumVarArray(myenv1, p1);
            for (int j = 0; j < p; j++) {
                Y[i][j] = IloNumVar(myenv1, 0, 1, ILOBOOL);
            }
        }

        /* constraints diffrent in each problem*/

        IloExprArray exp1(myenv1, m1);

        for (int i = 0; i < m1; i++) {
            exp1[i] = IloExpr(myenv1);
        }

        for (int i = 0; i < m1; i++) {
            for (int k = 0; k < nvar1; k++) {
                exp1[i] += A1[i][k] * X[k];
            }

            mymodel.add(exp1[i] <= b1[i]);
        }

        // Y

        /* constraints diffrent in each problem*/

        IloExprArray exp3(myenv1, l1 * (p1 + 1.0));

        for (int i = 0; i < l1 * (p1 + 1.0); i++) {
            exp3[i] = IloExpr(myenv1);
        }

        for (int i = 0; i < l1; i++) {
            for (int j = 0; j < p1; j++) {
                int k = j + (i * p1);
                exp3[k] = (ZZ1[i][j] + 1) * Y[i][j] + (1 - Y[i][j]) * (-M);
            }
        }

        IloExprArray exp4(myenv1, p1 * l1);

        for (int i = 0; i < p1 * l1; i++) {
            exp4[i] = IloExpr(myenv1);
        }

        for (int k = 0; k < l1; k++) {

            for (int i = 0; i < p1; i++) {
                for (int j = 0; j < nvar1; j++) {
                    int s = i + (k * p1);
                    exp4[s] += C1[i][j] * X[j];
                }
            }
        }

        for (int i = 0; i < l1 * p1; i++) {
            mymodel.add(exp4[i] >= exp3[i]);
        }

        // Objedctive function
        IloExpr exp(myenv1);
        for (int i = 0; i < nvar1; i++) {
            exp += V1[i] * X[i];
        }

        mymodel.add(IloMaximize(myenv1, exp));

        // solve
        IloCplex mycplex(myenv1);
        mycplex.extract(mymodel);
        mycplex.setParam(IloCplex::Param::Threads, 1); //////////////////////
        // to delete the writing
        mycplex.setOut(myenv1.getNullStream());

        IloBool feasible =
            mycplex.solve(); // solves model and stores
                             // whether or not it is feasible in an IloBool
                             // variable called "feasible"

        if (feasible == IloTrue) {

            for (int i = 0; i < p1; i++) {
                blocks1[ii].Z[i] = 0;
            }
            // value of x
            for (int i = 0; i < nvar1; i++) {
                blocks1[ii].X[i] = round(mycplex.getValue(X[i]));
            }

        } else {

            blocks1[ii].test1 = true;
            for (int i = 0; i < nvar1; i++) {
                blocks1[ii].X[i] = -M;
            }
            for (int i = 0; i < p1; i++) {

                blocks1[ii].Z[i] = -M;
            }
        }

        // Closing the Model
        mycplex.clear();
        myenv1.end();

#pragma omp critical
        {

            blocks[ii].test1 = blocks1[ii].test1;
            for (int i = 0; i < nvar; i++) {
                blocks[ii].X[i] = blocks1[ii].X[i];
            }

            for (int i = 0; i < p; i++) {
                blocks[ii].Z[i] = blocks1[ii].Z[i];
            }
        }
    }
}

I tried to make a copy of each variable on local memory and paste the values on it, and I used #pragma omp critical to this procedure to avoid false sharing and each tread pauses other threads and copy the variables. It made the computational time a little better but still diverges when increasing number of threads.

Did not dig in any deeper, but in general: Threads come with quite some overhead, stack needs to be allocated, they need to be correctly initialised, included into thread scheduler, ... – all this doesn't come for free. If you operate on data not large enough this overhead might outweigh the performance benefit from actual processing. — Aconcagua, Jul 11 '23 at 11:15
You're copying all data into local variables – not touching the original ones. Reading data concurrently doesn't produce a data race, so you shouldn't need the `omp critical` here. — Aconcagua, Jul 11 '23 at 11:21
Looks like a bunch of vectors get passed in. By value. Then they get copied. By each execution thread. Copying a bunch of vectors ten times takes longer than copying them once. Memory bandwidth is limited. That's why your CPUs have caches. All these threads are trampling each other, racing to make their own copies of the same, precious, bytes. Efficient multithreading is hard to get right. OMP, and multiple execution threads, are not simple magic buttons that only need to be pushed to make everything go faster. — Sam Varshavchik, Jul 11 '23 at 11:51
Creating threads takes time. Destroying threads takes time. Copying data takes time. Synchronization/locking takes time and may block threads for some time. Unless each thread does a significant amount of calculations that is independent of other threads, it's easy to end up in a situation where all the various overheads of using threads ends up being a net negative and just using one thread ends up being faster. Efficient multi threading is *difficult*. — Jesper Juhl, Jul 11 '23 at 12:02
How many threads are you using? It should never be more threads then you have cores either. — Pepijn Kramer, Jul 11 '23 at 12:31
@JesperJuhl OpenMP is very efficient in thread handling. Threads do not get created and destroyed: only paused and reactivated. I would not worry about this unless your parallel region is inside an outer loop that has thousands of iterations. — Victor Eijkhout, Jul 11 '23 at 13:22
Apart from the unnecessary data copying as noted by others, please get rid of all `vector>`. That is very inefficient even sequentially and probably more so in parallel. — Victor Eijkhout, Jul 11 '23 at 13:24
How do you measure the execution time? Please post the measured times for 1, 2, 4, 8 threads... As others mentioned, it's not useful to make private copies of data that are just read, and critical sections are not useful to read shared data. — PierU, Jul 11 '23 at 15:35
i am using Intel i5-11400H with 6 cores and 12 threads, and i am mesuring the time with a chrono from the beginning to the end of the code. The time i got is : 1thread->2155ms, 2 th -> 2709ms, 6 th -> 3252ms, 8th -> 6494ms. and when i remove #pragma omp critical, it gives some times similar or worse results. — Bederina Med, Jul 11 '23 at 18:41
"*i am mesuring the time with a chrono from the beginning to the end of the code*" Which chrono? Steady monotonic wall clocks should measure the time properly. The standard way is to call `omp_get_wtime`. If you use functions like `clock`, then you will not measure the wall clock time but the CPU time (see [this post](https://stackoverflow.com/questions/2962785)). — Jérôme Richard, Jul 11 '23 at 19:26
I do not know cplex much but it seems it can use multiple threads to solve the problem (regarding its configuration). If so, using even more threads will not make things faster but actually slower (like splitting task in many many small ones does not help to compute them faster with the same amount of people completing them). Please check the library is sequential using a profiler. — Jérôme Richard, Jul 11 '23 at 19:33
I tried #omp_get_wtime and i got same results. my program calls this function "Solving" multiple times, so it might be true that creating and destroying treads take time. For the Cplex, i did limit the number of threads used to 1 for each problem, so using one thread and solving multiple problems was more efficient than setting multiple threads and each one of them solves a different problem. — Bederina Med, Jul 12 '23 at 09:49
Did you remove the private copies of the data you are just reading ? — PierU, Jul 12 '23 at 10:03

when adding more threads to my parallel program (C++ OpenMp with Cplex library) it becomes slower

0 Answers0