0

I've been trying to improve the execution time performance of my PSO algorithm implementation for image template matching.

Basically, the algorithm tries to match a "sub-image" (template) in the original image the template was taken from. This is achieved by using an array of "particles" that are constantly updating a "fitness" value (each iteration, it kind of works like a genetic algorithm), wich is the result of this function (Normalized Cross Correlation):

double ncc(const CImg<int> &image, const CImg<int> &templateImg, int x, int y){
    double numerator = 0, imgSum = 0, tempSum = 0;

    cimg_forXY(templateImg, i, j){
        numerator += image(x+i, y+j) * templateImg(i, j);

        imgSum += pow(image(x+i, y+j), 2);
        tempSum += pow(templateImg(i, j), 2);
    }

    return numerator / (sqrt(imgSum) * sqrt(tempSum));
}

I'm using the CImg library and its predefined macros for looping through the image.

And I'm parallelizing the algorithm by giving each thread a subset of the particles array to process (I'm using <future> threads):

auto updateSubset = [&](int nThread, int from, int to){
    double score;
    Particle localgbest = gbest, currentParticle;

    for(int i = from; i < to; i++){
        currentParticle = particles[i];

        score = ncc(subsetImgs[nThread], subsetTemplates[nThread], 
            currentParticle.getPosition().getX(), 
            currentParticle.getPosition().getY()
        );

        if(score > currentParticle.getFitness()){
            currentParticle.setFitness(score);
            currentParticle.setPBest(currentParticle.getPosition());

            if(currentParticle.getFitness() > localgbest.getFitness()){
                localgbest = currentParticle;
            }
        }

        particles[i] = currentParticle;
    }

    return localgbest;
};

(gbest, particles[], subsetImgs[] and subsetTemplates[] are passed by reference). Finally, the main loop of the program:

for(int i = 0; i < nIterations; i++){
    for(int j = 0; j < nThreads; j++){
        threads[j] = async(launch::async, updateSubset, j, bounds[j], bounds[j + 1]);
    }

    for(int j = 0; j < nThreads; j++){            
        currentThreadGbest = threads[j].get();

        if(currentThreadGbest.getFitness() > gbest.getFitness())
            gbest = currentThreadGbest;
    }

    //velocity and position update
    for(int j = 0; j < nParticles; j++){
        particles[j].updatePosition(minPos, maxPos, gbest.getPosition());
    }
}

As you can see I'm trying to avoid false sharing by using the variables passed by reference as little as possible in the threads function (and that's why I'm creating the arrays subsetImgs[] and subsetTemplates[], both are filled with the same image to avoid the threads reading the same image at the same time). And as I've researched, the CImg macros are not blocking, but I'm still getting a poor performance with 2 threads or more (my CPU is an AMD A6-4400M with 2 cores and I'm measuring the time with clock):

  1. No parallel version of the algorithm: time = 3.09362
  2. Parallel version with one thread: time = 3.15978
  3. Parallel version with 2 threads: time = 4.41343

Any clue why this happens?, thanks in advance! (and sorry for the english).

Joberaco
  • 1
  • 1
  • 1

0 Answers0