0

I don't have much experience with openmp.

Is it possible to make the following code faster by using a for loop over pointer instead of index?

Are there anyway to make the following code faster?

The code multiplies an array by a constant.

Thank you.

code:

#include <iostream>
#include <stdlib.h>
#include <stdint.h>
#include <vector>
using namespace std;
int main(void){
    size_t dim0, dim1;
    dim0 = 100;
    dim1 = 200;
    std::vector<float> vec;
    vec.resize(dim0*dim1);
    float scalar = 0.9;
    size_t size_sq = dim0*dim1;
    #pragma omp parallel
    {       
        #pragma omp for
        for(size_t i = 0; i < size_sq; ++i){
            vec[i] *= scalar;
        }   
    }   
}

serial pointer loop

float* ptr_start = vec.data();
float* ptr_end   = ptr_start + dim0*dim1;
float* ptr_now;
for(ptr_now = ptr_start; ptr_now != ptr_end; ++ptr_now){
    *(ptr_now) *= scalar;
}
Khalil Khalaf
  • 9,259
  • 11
  • 62
  • 104
rxu
  • 1,369
  • 1
  • 11
  • 29
  • There are only 20,000 values in your loop, and CPU synchronisation also has some overhead. Have you measured how fast the loop is with and without OMP? Can you share those results? – H. Guijt Jul 29 '16 at 17:45
  • the actual array is much bigger than this one. i also want to know if I did something that hurts performance because I will use openmp at other places too. – rxu Jul 29 '16 at 17:46
  • Really generated code may differ from what you wrote. Did you disassemble release program with all optimizations? P.S.: do your OpenMP allow you to use `size_t` as index type? – ilotXXI Jul 29 '16 at 18:33
  • i am using intel c compiler. so far size_t worked. What is the correct index type to use? – rxu Jul 29 '16 at 18:37
  • 2
    As you are using Intel compiler, the opt-report options should be able to give you a quick assessment on relative efficiency of size_t, pointer, and int, whether setting omp for without simd clause inhibits vectorization, and the like. – tim18 Jul 30 '16 at 13:30

1 Answers1

1

Serial pointer loop should be like

size_t size_sq = vec.size();
float * ptr = vec.data();
#pragma omp parallel
{       
    #pragma omp for
    for(size_t i = 0; i < size_sq; i++){
        ptr[i] *= scalar;
    }   
} 

ptr will be the same for all threads so no problem there.

As an explanation, Data sharing attribute clauses (wikipedia):

shared: the data within a parallel region is shared, which means visible and accessible by all threads simultaneously. By default, all variables in the work sharing region are shared except the loop iteration counter.

private: the data within a parallel region is private to each thread, which means each thread will have a local copy and use it as a temporary variable. A private variable is not initialized and the value is not maintained for use outside the parallel region. By default, the loop iteration counters in the OpenMP loop constructs are private.

In this case, i is private and ptr shared.

J.J. Hakala
  • 6,136
  • 6
  • 27
  • 61
  • thanks. I didn't know the same address would refer to the same block of memory across all threads. – rxu Jul 29 '16 at 21:50
  • If this is parallelized successfully, default static scheduling will give each thread a nearly equal sized chunk. – tim18 Jul 30 '16 at 13:32
  • threads in the same process share address space except stack: http://stackoverflow.com/questions/1762418/process-vs-thread – rxu Jul 30 '16 at 14:08