3

I am using AMD Radeon HD 7700 GPU. I want to use the following kernel to verify the wavefront size is 64.

__kernel
void kernel__test_warpsize(
        __global T* dataSet,
        uint size
        )
{   
    size_t idx = get_global_id(0);

    T value = dataSet[idx];
    if (idx<size-1)
        dataSet[idx+1] = value;
}

In the main program, I pass an array with 128 elements. The initial values are dataSet[i]=i. After the kernel, I expect the following values: dataSet[0]=0 dataSet[1]=0 dataSet[2]=1 ... dataSet[63]=62 dataSet[64]=63 dataSet[65]=63 dataSet[66]=65 ... dataSet[127]=126

However, I found dataSet[65] is 64, not 63, which is not as my expectation.

My understanding is that the first wavefront (64 threads) should change dataSet[64] to 63. So when the second wavefront is executed, thread #64 should get 63 and write it to dataSet[65]. But I see dataSet[65] is still 64. Why?

talonmies
  • 70,661
  • 34
  • 192
  • 269
redpearl
  • 305
  • 4
  • 15
  • 1
    You should not be trying to verify warp or wave front size. If you write code that tests for warp sizes of 32 and 64, what happens when the device you use has a warp size of 8, 16 or perhaps 48 for that matter? What happens if Nvidia or AMD change their warp/wave front sizes. If you are trying to find optimal work group sizes then the best solution is to write a mini benchmark that tests all sensible configurations (or at least a sufficient sub-set). – chippies Nov 10 '13 at 12:48

2 Answers2

2

You are invoking undefined behaviour. If you wish to access memory another thread in a workgroup is writing you must use barriers.

In addition assume that the GPU is running 2 wavefronts at once. Then dataSet[65] indeed contains the correct value, the first wavefront has simply not been completed yet.

Also the output of all items as 0 is also a valid result according to spec. It's because everything could also be performed completely serially. That's why you need the barriers.

Based on your comments I edited this part:

Install http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/ Read: http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

Optimizing branching within a certain amount of threads is only a small part of optimization. You should read on how AMD HW schedules the wavefronts within a workgroup and how it hides memory latency by interleaving the execution of wavefronts (within a workgroup). The branching also affects the execution of the whole workgroup as the effective time to run it is basically the same as the time to execute the single longest running wavefront (It cannot free local memory etc until everything in the group is finished so it cannot schedule another workgroup). But this also depends on your local memory and register usage etc. To see what actually happens just grab CodeXL and run GPU profiling run. That will show exactly what happens on the device.

And even this applies only to just the hardware of current generation. That's why the concept is not on the OpenCL specification itself. These properties change a lot and depend a lot on the hardware.

But if you really want to know just what is AMD wavefront size the answer is pretty much awlways 64 (See http://devgurus.amd.com/thread/159153 for reference to their OpenCL programming guide). It's 64 for all GCN devices which compose their whole current lineup. Maybe some older devices have 16 or 32, but right now everything is just 64 (for nvidia it's 32 in general).

sharpneli
  • 1,601
  • 10
  • 9
  • Thanks for your reply first! I understand using barrier will force synchronization. What I want is just to avoid barrier (for performance reason) by utilizing the property "all threads in the same wavefront execute the same instructions". However, the result is not as expected. – redpearl Nov 11 '13 at 20:13
  • About "get_local_size(0)": I think it should return the number of threads in a workgroup (which is set by me), not the size of wavefront. Right? For example, if I set the workgroup size to 512, get_local_size(0) will return 512, not 64. – redpearl Nov 11 '13 at 20:16
  • If you set it to 512 it will almost certainly fail, the spec doesn't require implementations to support arbitrary local sizes. In AMD HW the local size is exactly the wavefront size. Same applies to Nvidia. In general you don't really need to care how the implementation will handle it. – sharpneli Nov 11 '13 at 20:29
  • Yes, 512 will fail because CL_KERNEL_WORK_GROUP_SIZE returns 256. However, I can set workgroup size as 256. Do you mean that this means the wavefront size is 256? Do you have any official link to support your claim "the local size is exactly the wavefront size"? – redpearl Nov 11 '13 at 21:23
  • I edited the main answer now that I have a better view for the reason you want to know the wavefront size. It indeed is 64 in the view of what sort of number of threads execute instructions in lockstep, however it is not the amount of threads a single processor executes at once. As an example if you have if(get_global_id(0) == 0) do_something_massive; else return; you effectively will stall 256 threads (or whatever is the amount of threads forming that workgroup, which is then composed of the wavefronts). – sharpneli Nov 11 '13 at 22:15
0

CUDA model - what is warp size? I think this is a good answer which explains the warp briefly.

But I am a bit confused about what sharpneli said such as " [If you set it to 512 it will almost certainly fail, the spec doesn't require implementations to support arbitrary local sizes. In AMD HW the local size is exactly the wavefront size. Same applies to Nvidia. In general you don't really need to care how the implementation will handle it. ]".

I think the local size which means the group size is set by the programmer. But when the implement occurs, the subdivied group is set by hardware like warp.

Community
  • 1
  • 1
zac
  • 111
  • 1
  • 9