How is the #pragma omp simd
directive translated for a GPU target device?
GPU's cores are handling a separate thread each. Threads are combined in groups of 32 threads (a single warp), and assigned to 32 cores for the purpose of the execution of a single instruction. But a SIMD is a subthreading term, meaning a single core should have a vector register, and be able to handle several chunks of data in the context of a single thread. This is not possible on a GPU core (each core handles a separate thread in a scalar manner).
Does it mean that a simd directive can't be translated for a GPU?
Or maybe - each thread is handled as if it had a single SIMD lane?
Or maybe - the SIMD iterations are spread across entire warp of 32 threads (but how about memory access then?) ?