Parallelizing the Gaussian Blur Algorithm with OpenMP

Question

I was trying to parallelize the gaussian blur function using OpenMP, but I am new at OpenMP, and when I tried to parallelize the two for loops (I don't think there are any variables that need to be private for each thread), it ended up running even slower than before, and the output was different. So did I do anything wrong? What should I do to make it run faster?

void gaussian_blur(float **src, float **dst, int w, int h, float sigma)
{
    int x, y, i;
    int ksize = (int)(sigma * 2.f * 4.f + 1) | 1;
    int halfk = ksize / 2;
    float scale = -0.5f/(sigma*sigma);
    float sum = 0.f;
    float *kernel, *ringbuf;
    int xmax = w - halfk;
    int ymax = h - halfk;

    // if sigma too small, just copy src to dst
    if (ksize <= 1)
    {
        for (y = 0; y < h; y++)
            for (x = 0; x < w; x++)
                dst[y][x] = src[y][x];
        return;
    }

    // create Gaussian kernel
    kernel = malloc(ksize * sizeof(float));
    ringbuf = malloc(ksize * sizeof(float));

    #pragma omp parallel for reduction(+ : sum)
    for (i = 0; i < ksize; i++)
    {
        float x = (float)(i - halfk);
        float t = expf(scale * x * x);
        kernel[i] = t;
        sum += t;
    }

    scale = 1.f / sum;
    #pragma omp parallel for
    for (i = 0; i < ksize; i++)
        kernel[i] *= scale;

    // blur each row
    #pragma omp parallel for // this is the for loop I parallelized but ended up with wrong output and running slower
    for (y = 0; y < h; y++)
    {
        int x1;
        int bufi0 = ksize-1;
        float tmp = src[y][0];

        for (x1 = 0; x1 < halfk  ; x1++) ringbuf[x1] = tmp;
        for (; x1 < ksize-1; x1++) ringbuf[x1] = src[y][x1-halfk];

        for (x1 = 0; x1 < w; x1++)
        {
            if(x1 < xmax)
                ringbuf[bufi0++] = src[y][x1+halfk];
            else
                ringbuf[bufi0++] = src[y][w-1];
            if (bufi0 == ksize) bufi0 = 0;
            dst[y][x1] = convolve(kernel, ringbuf, ksize, bufi0);
        }
    }

    // blur each column
    #pragma omp parallel for // this is the for loop I parallelized but ended up with wrong output and running slower
    for (x = 0; x < w; x++)
    {
        int y1;
        int bufi0 = ksize-1;
        float tmp = dst[0][x];
        for (y1 = 0; y1 < halfk  ; y1++) ringbuf[y1] = tmp;
        for (     ; y1 < ksize-1; y1++) ringbuf[y1] = dst[y1-halfk][x];

        for (y1 = 0; y1 < h; y1++)
        {
            if(y1 < ymax)
                ringbuf[bufi0++] = dst[y1+halfk][x];
            else
                ringbuf[bufi0++] = dst[h-1][x];

            if (bufi0 == ksize) bufi0 = 0;
            dst[y1][x] = convolve(kernel, ringbuf, ksize, bufi0);
        }
    }

    // clean up
    free(kernel);
    free(ringbuf);
}

It looks to me like you have a race condition in ringbuf. Each thread is competing to write to the same array elements. You may have other race conditions as well...also avoid using increment operates on arrays with OpenMP (e.g. ringbuf[bufio++]). You need to explicitly access the array as a function of the loop variables (x,y,...). — , May 18 '13 at 10:38
Welcome to SO @user2395837, you got two answers from two physicists: Leonard Hofstadter and Sheldon Cooper. I'll let you figure out which one of us is Sheldon. — , May 21 '13 at 13:50

score 2 · Answer 1 · 2013-05-20T12:45:47.130

I may have fixed your code. You did not post your convolve function so it's difficult to say for sure but I'm not sure it matters. There are at least two bugs. There is a race condition in the ringbuf array. To fix this I extend the array times the number of threads.

ringbuf = (float*)malloc(nthreads*ksize * sizeof(float));

To access the array do something like this

int ithread = omp_get_thread_num();
ringbuf[ksize*ithread + x1]

Edit: I added some code which defines ringbuf inside the parallel block. That way you don't have to access ringbuf based on the thread number.

The second errors is the ibufi0 variable. I defined a new one like this

const int ibufi0_fix = (x1+ksize-1)%ksize;

Below is the code I used to check it. Replace with your convolve function. Note, this may still be quite inefficient. There are probably cache issues such as cache misses and false sharing (particularly when you convolve vertically). Hopefully, though, the image will be correct now.

Edit: here is a paper by Intel that shows how to do this best with AVX. It's optimized to minimize the cache misses. I'm not sure it's optimized for threading though. http://software.intel.com/en-us/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions

I'm writing my own function on this (it's actually the reason I started learning OpenMP) which uses SSE/AVX as well. There are a lot of similarities with matrix multiplication and image filtering so I learned how to optimized matrix multiplication first and will do Gaussian Blur shortly...

#include "math.h"
#include "omp.h"
#include "stdio.h"
#include <nmmintrin.h>

float convolve(const float *kernel, const float *ringbuf, const int ksize, const int bufi0) {
    float sum = 0.0f;
    for(int i=0; i<ksize; i++) {
        sum += kernel[i]*ringbuf[i];
    }
    return sum;
}


void gaussian_blur(float *src, float *dst, int w, int h, float sigma, int nthreads)
{
    int x, y, i;
    int ksize = (int)(sigma * 2.f * 4.f + 1) | 1;
    int halfk = ksize / 2;
    printf("ksize %d\n", ksize);
    float scale = -0.5f/(sigma*sigma);
    float sum = 0.f;
    float *kernel, *ringbuf;
    int xmax = w - halfk;
    int ymax = h - halfk;

    // if sigma too small, just copy src to dst
    if (ksize <= 1)
    {
        for (y = 0; y < h; y++)
            for (x = 0; x < w; x++)
                dst[y*w + x] = src[y*w + x];
        return;
    }

    // create Gaussian kernel
    //kernel = malloc(ksize * sizeof(float));
    kernel =  (float*)_mm_malloc(ksize * sizeof(float),16);
    //ringbuf = malloc(ksize * sizeof(float));
    ringbuf = (float*)_mm_malloc(nthreads*ksize * sizeof(float),16);

    #pragma omp parallel for reduction(+ : sum) if(nthreads>1)
    for (i = 0; i < ksize; i++)
    {
        float x = (float)(i - halfk);
        float t = expf(scale * x * x);
        kernel[i] = t;
        sum += t;
    }

    scale = 1.f / sum;
    #pragma omp parallel for if(nthreads>1)
    for (i = 0; i < ksize; i++)
        kernel[i] *= scale;

    // blur each row
    #pragma omp parallel for if(nthreads>1)// this is the for loop I parallelized but ended up with wrong output and running slower
    for (y = 0; y < h; y++)
    {
        int ithread = omp_get_thread_num();
        //printf("nthread %d\n", nthread);
        int x1;
        int bufi0 = ksize-1;
        float tmp = src[y*w + 0];
        for (x1 = 0; x1 < halfk  ; x1++) ringbuf[ksize*ithread + x1] = tmp;
        for (; x1 < ksize-1; x1++) ringbuf[ksize*ithread + x1] = src[y*w + x1-halfk];
        for (x1 = 0; x1 < w; x1++)
        {
            const int ibufi0_fix = (x1+ksize-1)%ksize;

            if(x1 < xmax)
                ringbuf[ksize*ithread + ibufi0_fix] = src[y*w + x1+halfk];
            else
                ringbuf[ksize*ithread + ibufi0_fix] = src[y*w + w-1];
            if (bufi0 == ksize) bufi0 = 0;
            dst[y*w + x1] = convolve(kernel, &ringbuf[ksize*ithread], ksize, bufi0);
        }
    }
    // blur each column
    #pragma omp parallel for if(nthreads>1)// this is the for loop I parallelized but ended up with wrong output and running slower
    for (x = 0; x < w; x++)
    {
        int ithread = omp_get_thread_num();
        int y1;
        int bufi0 = ksize-1;
        float tmp = dst[0*w + x];
        for (y1 = 0; y1 < halfk  ; y1++) ringbuf[ksize*ithread + y1] = tmp;
        for (     ; y1 < ksize-1; y1++) ringbuf[ksize*ithread + y1] = dst[(y1-halfk)*w + x];

        for (y1 = 0; y1 < h; y1++)
        {
            const int ibufi0_fix = (y1+ksize-1)%ksize;
            if(y1 < ymax)
                ringbuf[ibufi0_fix] = dst[(y1+halfk)*w + x];
            else
                ringbuf[ibufi0_fix] = dst[(h-1)*w + x];

            if (bufi0 == ksize) bufi0 = 0;
            dst[y1*w + x] = convolve(kernel, &ringbuf[ksize*ithread], ksize, bufi0);
        }
    }

    // clean up
    _mm_free(kernel);
    _mm_free(ringbuf);
}

int compare(float *dst1, float *dst2, const int n) {
    int error = 0;
    for(int i=0; i<n; i++) {
        if(*dst1 != *dst2) error++;
    }
    return error;
}


int main() {
    const int w = 20;
    const int h = 20;

    float *src =  (float*)_mm_malloc(w*h*sizeof(float),16);
    float *dst1 =  (float*)_mm_malloc(w*h*sizeof(float),16);
    float *dst2 =  (float*)_mm_malloc(w*h*sizeof(float),16);
    for(int i=0; i<w*h; i++) {
        src[i] = i;
    }

    gaussian_blur(src, dst1, w, h, 1.0f, 1);
    gaussian_blur(src, dst2, w, h, 1.0f, 4);
    int error = compare(dst1, dst2, w*h);
    printf("error %d\n", error);
    _mm_free(src);
    _mm_free(dst1);
    _mm_free(dst2);
}

Edit: here is code which defines ringbuf inside the parallel block based on the comment by Hristo. It should be equivalent.

#include "math.h"
#include "omp.h"
#include "stdio.h"
#include <nmmintrin.h>

float convolve(const float *kernel, const float *ringbuf, const int ksize, const int bufi0) {
    float sum = 0.0f;
    for(int i=0; i<ksize; i++) {
        sum += kernel[i]*ringbuf[i];
    }
    return sum;
}


void gaussian_blur(float *src, float *dst, int w, int h, float sigma, int nthreads)
{
    int x, y, i;
    int ksize = (int)(sigma * 2.f * 4.f + 1) | 1;
    int halfk = ksize / 2;
    printf("ksize %d\n", ksize);
    float scale = -0.5f/(sigma*sigma);
    float sum = 0.f;
    float *kernel;
    int xmax = w - halfk;
    int ymax = h - halfk;

    // if sigma too small, just copy src to dst
    if (ksize <= 1)
    {
        for (y = 0; y < h; y++)
            for (x = 0; x < w; x++)
                dst[y*w + x] = src[y*w + x];
        return;
    }

    // create Gaussian kernel
    //kernel = malloc(ksize * sizeof(float));
    kernel =  (float*)_mm_malloc(ksize * sizeof(float),16);

    #pragma omp parallel for reduction(+ : sum) if(nthreads>1)
    for (i = 0; i < ksize; i++)
    {
        float x = (float)(i - halfk);
        float t = expf(scale * x * x);
        kernel[i] = t;
        sum += t;
    }

    scale = 1.f / sum;
    #pragma omp parallel for if(nthreads>1)
    for (i = 0; i < ksize; i++)
        kernel[i] *= scale;

    // blur each row
    //#pragma omp parallel for if(nthreads>1)// this is the for loop I parallelized but ended up with wrong output and running slower
    #pragma omp parallel if(nthreads>1) 
    {
        float *ringbuf = (float*)_mm_malloc(ksize * sizeof(float),16);
        #pragma omp for// this is the for loop I parallelized but ended up with wrong output and running slower
        for (y = 0; y < h; y++)
        {
            //printf("nthread %d\n", nthread);
            int x1;
            int bufi0 = ksize-1;
            float tmp = src[y*w + 0];
            for (x1 = 0; x1 < halfk  ; x1++) ringbuf[x1] = tmp;
            for (; x1 < ksize-1; x1++) ringbuf[x1] = src[y*w + x1-halfk];
            for (x1 = 0; x1 < w; x1++)
            {
                const int ibufi0_fix = (x1+ksize-1)%ksize;

                if(x1 < xmax)
                    ringbuf[ibufi0_fix] = src[y*w + x1+halfk];
                else
                    ringbuf[ibufi0_fix] = src[y*w + w-1];
                if (bufi0 == ksize) bufi0 = 0;
                dst[y*w + x1] = convolve(kernel, ringbuf, ksize, bufi0);
            }
        }
        _mm_free(ringbuf);
    }

    // blur each column
    #pragma omp parralel if(ntheads>1)
    {
        float *ringbuf = (float*)_mm_malloc(ksize * sizeof(float),16);
        #pragma omp for// this is the for loop I parallelized but ended up with wrong output and running slower
        for (x = 0; x < w; x++)
        {
            int y1;
            int bufi0 = ksize-1;
            float tmp = dst[0*w + x];
            for (y1 = 0; y1 < halfk  ; y1++) ringbuf[y1] = tmp;
            for (     ; y1 < ksize-1; y1++) ringbuf[y1] = dst[(y1-halfk)*w + x];

            for (y1 = 0; y1 < h; y1++)
            {
                const int ibufi0_fix = (y1+ksize-1)%ksize;
                if(y1 < ymax)
                    ringbuf[ibufi0_fix] = dst[(y1+halfk)*w + x];
                else
                    ringbuf[ibufi0_fix] = dst[(h-1)*w + x];

                if (bufi0 == ksize) bufi0 = 0;
                dst[y1*w + x] = convolve(kernel, ringbuf, ksize, bufi0);
            }
        }
        _mm_free(ringbuf);
    }
    // clean up
    _mm_free(kernel);
}

int compare(float *dst1, float *dst2, const int n) {
    int error = 0;
    for(int i=0; i<n; i++) {
        if(*dst1 != *dst2) error++;
    }
    return error;
}


int main() {
    const int w = 20;
    const int h = 20;

    float *src =  (float*)_mm_malloc(w*h*sizeof(float),16);
    float *dst1 =  (float*)_mm_malloc(w*h*sizeof(float),16);
    float *dst2 =  (float*)_mm_malloc(w*h*sizeof(float),16);
    for(int i=0; i<w*h; i++) {
        src[i] = i;
    }

    gaussian_blur(src, dst1, w, h, 1.0f, 1);
    gaussian_blur(src, dst2, w, h, 1.0f, 4);
    int error = compare(dst1, dst2, w*h);
    printf("error %d\n", error);
    _mm_free(src);
    _mm_free(dst1);
    _mm_free(dst2);
}

Why are you reinventing the wheel when you can simply declare `ringbuf` as `private`? — Hristo Iliev, May 19 '13 at 15:02
@HristoIliev, thanks, that's a good point. I added code to reflect what I think you mean. Since is an array can I really define it outside the parallel block and then simply define it as private? Don't I need to either explicitly make nthread copies of the array outside the parallel block or define it inside the parallel block? — , May 19 '13 at 16:09
Thank you so much for your help, I understand the race condition you mentioned, but shouldn't the call to convolve() function has bufi0 changed to ibufi0_fix instead? Since bufi0 is unchanged from your code, thanks. — user2395837, May 19 '13 at 19:31
Yes, that's correct. I used a generic convolve function which did not do anything with bufi0. You should pass ibufi0_fix. — , May 19 '13 at 21:43
Having `ringbuf` declared inside the scope of the parallel block makes it implicitly private. If you declare it outside, then it must be explicitly declared as `private`. If it is a pointer rather than array, then memory also has to be explicitly allocated inside the parallel region. — Hristo Iliev, May 20 '13 at 00:42
@Hristolliev : Interesting, I was not aware that OpenMP would automatically handle private arrays declared on the stack outside of the parallel region. http://stackoverflow.com/questions/2352895/how-to-ensure-a-dynamically-allocated-array-is-private-in-openmp In any case, these are dynamic arrays so I did the right thing by allocating the memory myself inside the parallel region. — , May 20 '13 at 07:38
Also these `if(nthreads>1)` clauses are redundant. They only inactivate the parallel regions at runtime if the number of threads is more than one, which is already the case in how OpenMP treats activation of parallel regions (according to the standard, an active parallel region is one that is being executed by a team of more than one thread). — Hristo Iliev, May 20 '13 at 10:39
@HristoIliev, if(nthreads>1) in order to try and remove the overhead. But it does not seem to make a difference http://stackoverflow.com/questions/16635814/reasons-for-omp-set-num-threads1-slower-than-no-openmp/16646282#16646282 — , May 20 '13 at 12:31
@user2395837, I added a link to my answer by Intel that shows how to do the gaussing blur in a cache friendly way. — , May 20 '13 at 12:47

score 2 · Answer 2 · answered May 20 '13 at 11:24

Besides the need to properly identify private and shared data, there are several things that you could do in order to speed up your program.

As a first step you should remove any unnecessary concurrency. For example, how big ksize happens to be on average? If it is less than several hundred elements, it makes absolutely no sense to employ OpenMP for such simple operations as computing the kernel and then normalising it:

#pragma omp parallel for reduction(+ : sum)
for (i = 0; i < ksize; i++)
{
    float x = (float)(i - halfk);
    float t = expf(scale * x * x);
    kernel[i] = t;
    sum += t;
}

scale = 1.f / sum;
#pragma omp parallel for
for (i = 0; i < ksize; i++)
    kernel[i] *= scale;

On a typical modern CPU it would take more cycles to bootstrap the parallel regions than to compute this on a single core. Also on modern CPUs these loops can be unrolled and vectorised and you can get up to 8x boost on a single core. If the kernel is too small, then besides OpenMP overhead you will also get slowdown from excessive false sharing. You have to make sure that each thread gets an exact multiple of 16 elements (64 bytes of cache line size / sizeof(float)) to work on in order to prevent false sharing.

You also have to make sure that threads do not share cache lines in the column blur section.

// blur each column
#pragma omp parallel for
for (x = 0; x < w; x++)
{
    ...
    for (y1 = 0; y1 < h; y1++)
    {
        ...
        dst[y1][x] = convolve(kernel, ringbuf, ksize, bufi0);
    }
}

Because of the access pattern here, you have to make sure that each thread gets a chunk of columns that is a multiple of 16 or else there will be a border overlap area of 16*y1 pixels shared by every two consecutive threads where excessive false sharing will occur. If you cannot guarantee that w is divisible by 16, then you can give each thread a starting offset in the y direction, e.g. the innermost loop becomes:

int tid = omp_get_thread_num();

for (y1 = 2*tid; y1 < h; y1++)
{
   ...
}
for (y1 = 0; y1 < 2*tid; y1++)
{
   ...
}

The multiplier 2 is arbitrary. The idea is to give the next thread several rows of advance in comparison to the current one so that both threads will not be processing the same line at once at any moment in time. You could also use addition and modulo arithmetic to compute y1, i.e.

for (y2 = 0; y2 < h; y2++)
{
    y1 = (y2 + 2*tid) % h;
    ...
}

but this is generally slower than just separating the loop in two parts.

Also mind your data size. The last level cache (LLC) has very high but still limited bandwidth. If data cannot fit in the private cache of each core then compiler optimisations such as loop vectorisations can put very high pressure on the LLC. Things get more ugly if data doesn't fit in the LLC and therefore the main memory has to be accessed.

If you don't know what false sharing is, there is an article in Dr.Dobb's that kind of explains it here.

@Hristolliev, very good comments, particularly the one about not using OpenMP to generate the kernel. But how about you give me an up vote (and the user since I think it's a good question) since I think I did identify the race condition which was what was giving the wrong result and likely one of the biggest slow downs. — , May 20 '13 at 12:52

Parallelizing the Gaussian Blur Algorithm with OpenMP

2 Answers2