1

I am trying to understand/test OpenMP with GPU offload. However, I am confused because some examples/info (1, 2, 3) in the internet are analogous or similar to mine but my example does not work as I think it should. I am using g++ 9.4 on Ubuntu 20.04 LTS and also installed gcc-9-offload-nvptx.

My example that does not work but is similar to this one:

#include <iostream>                                                                                                                                                                                                
#include <vector>                                                                                                                                                   
                                                                                                                                                                                                                   
                                                                                                                                                                                                                   
int main(int argc, char *argv[]) {                                                                                                                                                                                 
                                                                                                                                                                                                                   
  typedef double myfloat;                                                                                                                                                                                          
                                                                                                                                                                                                                   
  if (argc != 2) exit(1);                                                                                                                                                                                          
                                                                                                                                                                                                                   
  size_t size = atoi(argv[1]);                                                                                                                                                                                     
  printf("Size: %zu\n", size);                                                                                                                                                                                     
                                                                                                                                                                                                                   
  std::vector<myfloat> data_1(size, 2);                                                                                                                                                                            
  myfloat *data1_ptr = data_1.data();                                                                                                                                                                              
  myfloat sum = -1;                                                                                                                                                                                                
                                                                                                                                                                                                                   
#pragma omp target map(tofrom:sum) map(from: data1_ptr[0:size])                                                                                                                                                    
#pragma omp teams distribute parallel for simd reduction(+:sum) collapse(2)                                                                                                                                        
  for (size_t i = 0; i < size; ++i) {                                                                                                                                                                              
    for (size_t j = 0; j < size; ++j) {                                                                                                                                                                            
                                                                                                                                                                                                                   
      myfloat term1 = data1_ptr[i] * i;                                                                                                                                                                            
      sum += term1 / (1 + term1 * term1 * term1);                                                                                                                                                                  
    }                                                                                                                                                                                                              
  }                                                                                                                                                                                                                
                                                                                                                                                                                                                   
  printf("sum: %.2f\n", sum);                                                                                                                                                                                      
                                                                                                                                                                                                                   
  return 0;                                                                                                                                                                                                        
} 

When I compile it with: g++ main.cpp -o test -fopenmp -fcf-protection=none -fno-stack-protector I get the following

stack_example.cpp: In function ‘main._omp_fn.0.hsa.0’:                                                                                                                                                          
cc1plus: warning: could not emit HSAIL for the function [-Whsa]                                                                                                                                                 
cc1plus: note: support for HSA does not implement non-gridified OpenMP parallel constructs.

It does compile but when using it with

./test 10000
                                                                                                             

the printed sum is still -1. I think the sum value passed to the GPU was not returned properly but I explicitly map it, so shouldn't it be returned? Or what am I doing wrong?

EDIT 1

I was ask to modify my code because there was a historically grown redundant for loop and also sum was initialized with -1. I fixed that and also compiled it with gcc-11 which did not throw a warning or note as did gcc-9. However the behavior is similar:

Size: 100                                                                                                                                                                                                                                                                    
Number of devices: 2                                                                                                                                                                                                                                                         
sum: 0.00 

I checked with nvtop, the GPU is used. Because there are two GPUs I can even switch the device and can be seen by nvtop.

Solution: The fix is very easy and stupid. Changing map(from: data1_ptr[0:size]) to map(tofrom: data1_ptr[0:size]) did the trick. Even though I am not writing to the array this seemed to be the problem.

Andi
  • 65
  • 1
  • 6
  • 2
    GCC 9 is a bit old for this though it looks find for this specific program. See the [GOMP support](https://gcc.gnu.org/projects/gomp/) of OpenMP. Do you get the same result with GCC 12? By the way, setting `sum` to -1 is non-conforming to the OpenMP standard. It must be initialized to 0. Besides, note that double computation is generally very slow on client-side GPUs (it is OK only on expensive server-side GPU). Also note that `j` is unused in the loop and it could be completely optimized out by a clever compiler. Also, did you tried to run a basic HSA/OpenCL example (eg. hello world)? – Jérôme Richard Aug 11 '22 at 12:03

0 Answers0