OpenMP integral image slower then sequential

Question

I have implemented Summed Area Table (or Integral image) in C++ using OpenMP.

The problem is that the Sequential code is always faster then the Parallel code even changing the number of threads and image sizes.

For example I tried images from (100x100) to (10000x10000) and threads from 1 to 64, but none of the combination is ever faster.

I also tried this code in different machines like:

Mac OSX 1,4 GHz Intel Core i5 dual core
Mac OSX 2,3 GHz Intel Core i7 quad core
Ubuntu 16.04 Intel Xeon E5-2620 2,4 GHz 12 cores

The time has been measured with OpenMP function: omp_get_wtime().

For compiling I use: g++ -fopenmp -Wall main.cpp.

Here is the parallel code:

void transpose(unsigned long *src, unsigned long *dst, const int N, const int M) {
    #pragma omp parallel for
    for(int n = 0; n<N*M; n++) {
        int i = n/N;
        int j = n%N;
        dst[n] = src[M*j + i];
    }
}


unsigned long * integralImageMP(uint8_t*x, int n, int m){

    unsigned long * out = new unsigned long[n*m];
    unsigned long * rows = new unsigned long[n*m];

    #pragma omp parallel for
    for (int i = 0; i < n; ++i)
    {
        rows[i*m] = x[i*m];
        for (int j = 1; j < m; ++j)
        {
            rows[i*m + j] = x[i*m + j] + rows[i*m + j - 1];
        }
    }

    transpose(rows, out, n, m);

    #pragma omp parallel for
    for (int i = 0; i < n; ++i)
    {
        rows[i*m] = out[i*m];
        for (int j = 1; j < m; ++j)
        {
            rows[i*m + j] = out[i*m + j] + rows[i*m + j - 1];
        }
    }

    transpose(rows, out, m, n);

    delete [] rows;
    return out;
}

Here is the sequential code:

unsigned long * integralImage(uint8_t*x, int n, int m){
    unsigned long * out = new unsigned long[n*m];

    for (int i = 0; i < n; ++i)
    {
        for (int j = 0; j < m; ++j)
        {
            unsigned long val = x[i*m + j];
            if (i>=1)
            {
                val += out[(i-1)*m + j];
                if (j>=1)
                {
                    val += out[i*m + j - 1] - out[(i-1)*m + j - 1];
                }
            } else {
                if (j>=1)
                {
                    val += out[i*m + j -1];
                }
            }
            out[i*m + j] = val;
        }
    }

    return out;
}

I also tried without the transpose but it was even slower probably because the cache accesses.

An example of calling code:

int main(int argc, char **argv){
    uint8_t* image = //read image from file (gray scale)
    int height = //height of the image
    int width = //width of the image

    double start_omp = omp_get_wtime();

    unsigned long* integral_image_parallel = integralImageMP(image, height, width); //parallel

    double end_omp = omp_get_wtime();

    double time_tot = end_omp - start_omp;

    std::cout << time_tot << std::endl;

    start_omp = omp_get_wtime();

    unsigned long* integral_image_serial = integralImage(image, height, width); //sequential

    end_omp = omp_get_wtime();

    time_tot = end_omp - start_omp;

    std::cout << time_tot << std::endl;

    return 0;
}

Each thread is working on a block of rows (maybe an illustration of what each thread is doing can be useful): Where ColumnSum is done transposing the matrix and repeating RowSum.

How do you measure time? Please provide a [mcve], compilation details and specific measurement results. — Zulan, May 15 '18 at 14:02
We use OpenMP function `omp_get_wtime()`, I will update the question with this detail. — Francesco Pegoraro, May 15 '18 at 14:05
@HighPerformanceMark I also tried copying the `uint8_t` image to an `unsigned long` and the result is the same (obviously I corrected the rest of the code to accept an unsigned long pointer and the copy was not considered in the time calculus). — Francesco Pegoraro, May 15 '18 at 14:40
Doing all those `if` statements inside the double loops could be stalling your CPU horribly. If you don't want to do the first row and column, can you maybe just start your loops from 1 instead of from 0. — Mark Setchell, May 15 '18 at 15:02
@MarkSetchell Branch prediction would negate most of those effects. — 1201ProgramAlarm, May 15 '18 at 15:20
Just FYI - AVX may help more than OpenMP... https://stackoverflow.com/a/46520678/2836621 — Mark Setchell, May 15 '18 at 21:22

Aki Suihkonen · Accepted Answer · 2018-05-16T12:03:28.510

Let me first say, that the results are a bit surprising to me and I would guesstimate the problem being in the non local memory access required by the transpose algorithm.

You can anyway mitigate it by turning your sequential algorithm into parallel by a two pass approach. The first pass has to calculate the 2D integral in T threads N rows apart and the second pass must compensate the fact that each block didn't start from the accumulated result of the previous row but from zero.

An example with Matlab shows the principle in 2D.

 f=fix(rand(12,8)*8)   % A random matrix with 12 rows, 8 columns

 5     6     1     4     7     5     4     4
 4     6     0     7     1     3     2     0
 7     0     2     3     0     1     6     3
 5     3     1     7     4     3     7     2
 6     4     3     2     7     3     5     1
 3     3     2     5     5     0     2     1
 3     5     7     5     1     4     4     3
 6     5     7     4     2     1     0     0
 0     2     0     5     3     3     7     4
 1     3     5     5     7     4     7     3
 1     0     2     1     1     2     6     5
 3     7     3     1     6     2     2     5


ff=cumsum(cumsum(f')')   % The Summed Area Table
 5    11    12    16    23    28    32    36
 9    21    22    33    41    49    55    59
16    28    31    45    53    62    74    81
21    36    40    61    73    85   104   113
27    46    53    76    95   110   134   144
30    52    61    89   113   128   154   165
33    60    76   109   134   153   183   197
39    71    94   131   158   178   208   222
39    73    96   138   168   191   228   246
40    77   105   152   189   216   260   281
41    78   108   156   194   223   273   299
44    88   121   170   214   245   297   328

fx=[cumsum(cumsum(f(1:4,:)')');   %  The original table summed in 
    cumsum(cumsum(f(5:8,:)')');   %  three parts -- 4 rows per each
    cumsum(cumsum(f(9:12,:)')')]  %  "thread"

 5    11    12    16    23    28    32    36
 9    21    22    33    41    49    55    59
16    28    31    45    53    62    74    81
21    36    40    61    73    85   104   113   %% Notice this row #4
 6    10    13    15    22    25    30    31
 9    16    21    28    40    43    50    52
12    24    36    48    61    68    79    84
18    35    54    70    85    93   104   109   %% Notice this row #8
 0     2     2     7    10    13    20    24
 1     6    11    21    31    38    52    59
 2     7    14    25    36    45    65    77
 5    17    27    39    56    67    89   106

fx(4,:) + fx(8,:)  %% this is the SUM of row #4 and row #8
39    71    94   131   158   178   208   222

 %% and finally -- what is the difference of the piecewise
 %% calculated result and the real result?
 ff-fx

 0     0     0     0     0     0     0     0    %% look !! the first block 
 0     0     0     0     0     0     0     0    %% is already correct
 0     0     0     0     0     0     0     0
 0     0     0     0     0     0     0     0
21    36    40    61    73    85   104   113    %% All these rows in this
21    36    40    61    73    85   104   113    %% block are short by
21    36    40    61    73    85   104   113    %% the row #4 above
21    36    40    61    73    85   104   113    %%
39    71    94   131   158   178   208   222  %%   and all these rows
39    71    94   131   158   178   208   222  %%   in this block are short
39    71    94   131   158   178   208   222  %%   by the SUM of the rows
39    71    94   131   158   178   208   222  %%   #4 and #8 above

Fortunately one can start integrating the block 2, i.e. rows 2N..3N-1 before the block #1 has been compensated -- one just has to calculate the offset, which is a relatively small sequential task.

acc_for_block_2 = row[2*N-1] + row[N-1];
acc_for_block_3 = acc_for_block_2 + row[3*N-1];
..
acc_for_block_T-1 = acc_for_block_(T-2) + row[N*(T-1)-1];

I don't think the transpose operation is an issue. I profiled OP code and it occupies a small amount of time. Moreover, removing the transpose operation and doing the columns prefix sum on the columns (not a cache wise operation), is much slower than transposing and then cycling along the rows as OP does (I'm not guessing, I tried) (this is all valid for somewhat big images, I don't know what happens for really small images). — Luca Angioloni, May 15 '18 at 18:01
Anyway your method may still be more efficient, but I don't understand it fully. Could you elaborate a little? — Luca Angioloni, May 15 '18 at 18:01

OpenMP integral image slower then sequential

1 Answers1