Worse performance with multiple threads in C++

Question

I've been trying to improve the execution time performance of my PSO algorithm implementation for image template matching.

Basically, the algorithm tries to match a "sub-image" (template) in the original image the template was taken from. This is achieved by using an array of "particles" that are constantly updating a "fitness" value (each iteration, it kind of works like a genetic algorithm), wich is the result of this function (Normalized Cross Correlation):

double ncc(const CImg<int> &image, const CImg<int> &templateImg, int x, int y){
    double numerator = 0, imgSum = 0, tempSum = 0;

    cimg_forXY(templateImg, i, j){
        numerator += image(x+i, y+j) * templateImg(i, j);

        imgSum += pow(image(x+i, y+j), 2);
        tempSum += pow(templateImg(i, j), 2);
    }

    return numerator / (sqrt(imgSum) * sqrt(tempSum));
}

I'm using the CImg library and its predefined macros for looping through the image.

And I'm parallelizing the algorithm by giving each thread a subset of the particles array to process (I'm using <future> threads):

auto updateSubset = [&](int nThread, int from, int to){
    double score;
    Particle localgbest = gbest, currentParticle;

    for(int i = from; i < to; i++){
        currentParticle = particles[i];

        score = ncc(subsetImgs[nThread], subsetTemplates[nThread], 
            currentParticle.getPosition().getX(), 
            currentParticle.getPosition().getY()
        );

        if(score > currentParticle.getFitness()){
            currentParticle.setFitness(score);
            currentParticle.setPBest(currentParticle.getPosition());

            if(currentParticle.getFitness() > localgbest.getFitness()){
                localgbest = currentParticle;
            }
        }

        particles[i] = currentParticle;
    }

    return localgbest;
};

(gbest, particles[], subsetImgs[] and subsetTemplates[] are passed by reference). Finally, the main loop of the program:

for(int i = 0; i < nIterations; i++){
    for(int j = 0; j < nThreads; j++){
        threads[j] = async(launch::async, updateSubset, j, bounds[j], bounds[j + 1]);
    }

    for(int j = 0; j < nThreads; j++){            
        currentThreadGbest = threads[j].get();

        if(currentThreadGbest.getFitness() > gbest.getFitness())
            gbest = currentThreadGbest;
    }

    //velocity and position update
    for(int j = 0; j < nParticles; j++){
        particles[j].updatePosition(minPos, maxPos, gbest.getPosition());
    }
}

As you can see I'm trying to avoid false sharing by using the variables passed by reference as little as possible in the threads function (and that's why I'm creating the arrays subsetImgs[] and subsetTemplates[], both are filled with the same image to avoid the threads reading the same image at the same time). And as I've researched, the CImg macros are not blocking, but I'm still getting a poor performance with 2 threads or more (my CPU is an AMD A6-4400M with 2 cores and I'm measuring the time with clock):

No parallel version of the algorithm: time = 3.09362
Parallel version with one thread: time = 3.15978
Parallel version with 2 threads: time = 4.41343

Any clue why this happens?, thanks in advance! (and sorry for the english).

"I'm measuring the time with clock" that's exactly the wrong thing to do. — n. m. could be an AI, May 08 '18 at 03:51
This whole time!, I feel mercilessly betrayed by my own program. Thank you so much! — Joberaco, May 08 '18 at 04:44
instead of `sqrt(imgSum) * sqrt(tempSum)`, `sqrt(imgSum*tempSum)` would be simpler and faster — phuclv, May 08 '18 at 09:28
To improve performance, I would do the math in the inner loop in ncc() on integer values, and only convert to double on the return line. Also, x*x may be faster that pow(x, 2)... — Vlad Feinstein, May 09 '18 at 21:58

Worse performance with multiple threads in C++

0 Answers0