OpenMP offloading on GPU, 'simd' specificities

Question

I was wondering how to interpret the following OpenMP constructs:

#pragma omp target teams distribute parallel for
for(int i = 0; i < N; ++i) {
    // compute
}

#pragma omp target teams distribute parallel for simd
for(int i = 0; i < N; ++i) {
    // compute
}

Note the simd clause added on the second loop. According to the OpenMP 5.1 specification, this clause declare that: "multiple iterations of the loop can be executed concurrently by using SIMD instructions".

I believe I can very well conceive how simd is implemented and behaves on CPU but on GPU, more precisely, AMD GPUs, there is no such thing as exposed SIMD instruction in the sense that a HIP thread is in fact a SIMD instruction lane.

According to the OpenMP specification, if there is a loop carried dependency or if the compiler can not prove there is none, when OpenMP maps the teams to thread blocks/workgroups and the treads to simd lanes it is forced to use thread blocks of only one thread.

How do you interpret the target teams distribute parallel for simd:

Does it mean that in this context simd can't be translated for a GPU?
Or maybe - each thread is handled as if it had a single SIMD lane?

There is at least one similar but old and unanswered question: How is omp simd for loop executed on GPUs?

Could it be that it is actually ignored? Have you tried running with and w/out the `simd` switch? Is there any difference in the compiled results or in the performances? — Fra93, Nov 10 '22 at 15:45
@Fra93 actually, I have tried and in practice and on loops like the one presented above (potentially with a collapse) did not see any performance degradation or improuvent. I used the amdclang and HPE-cray compiler. Some month ago, there was no viable support for the 'simd' clause + gpu offloading on the cray compiler (it forced one thread per thread block !). — Etienne M, Nov 10 '22 at 15:49

Etienne M · Answer 1 · 2022-11-18T13:48:31.360

According to the test case below, the assembly generated for AMD MI250 (gfx90a) is the same with or without simd. Though, if you look at the CPU code, you shall see a significant change with the simd clause which in this case, allows for a similar optimization to the ones observed with an explicit usage of the restrict keyword.

TLDR: Currently, the simd clause is irrelevant and only leads to this warning, even for extremely trivial cases: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning].

#include <cstdint>

#define RESTRICT __restrict

using Float = double;

void test0_0(Float* a, const Float* b) {
    a[0] = b[0] * b[0];
    // Forced store/reload (b[0] could be a[0]).
    a[1] = b[0];
}

void test0_1(Float* RESTRICT a, const Float* RESTRICT b) {
    a[0] = b[0] * b[0];
    // No forced store/reload.
    a[1] = b[0];
}

void test1_0(Float* a, Float* b, std::size_t length) {
#pragma omp parallel for
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // Forced store/reload
        a[i + 1] = b[i + 0];
    }
}

void test1_1(Float* a, Float* b, std::size_t length) {
#pragma omp parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // simd -> no loop carried dependencies:
        // No forced store/reload -> easier vectorization, less generated code.
        a[i + 1] = b[i + 0];
    }
}

void test2_0(Float* a, Float* b, std::size_t length) {
#pragma omp target teams distribute parallel for
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // ASM shows forced store/reload, as expected.
        a[i + 1] = b[i + 0];
    }
}

void test2_1(Float* RESTRICT a, Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

void test3_0(Float* a, const Float* b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

void test3_1(Float* RESTRICT a, const Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0] * b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}
 test2_1(Float* RESTRICT a, Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

void test3_0(Float* a, const Float* b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

void test3_1(Float* RESTRICT a, const Float* RESTRICT b, std::size_t length) {
#pragma omp target teams distribute parallel for simd
    for (std::size_t i = 0; i < length; i += 2) {
        a[i + 0] = b[i + 0];
        // ASM shows forced store/reload, but a/b are restricted BAD!
        a[i + 1] = b[i + 0];
    }
}

Code available at: https://godbolt.org/z/sMY48s8jz

OpenMP offloading on GPU, 'simd' specificities

1 Answers1