How ALU execute instruction in AMD GPU (VLIW)?

Question

I wanna ask something about OpenCL programming. I understand that a quarter of wavefront can issue instruction for each cycle clock and it will need four cycle clock to call a wavefront. To finish the instruction in VLIW architecture, it will need eight cycle clock. So, calling another wavefront is a solution. If I call two wavefront then it will be eight cycle clocks. So after wavefront A is executed ( 4 cycle clock), then wavefront B is executed (another four cycle clock). After wavefront B is executed (the total cycle clock is 8), the wavefront A will be executed again with another instruction.

The question is:

How ALU execute another instruction if four ALU for each processing element is already used to execute another instruction??

For Example: In cycle 1, work item 0-15 begin to execute instruction "ADD". The first ALU in each processing element (total 16 PE in SIMD / compute unit) calculate "ADD" instruction.
It happens in cycle 2, 3, and 4 for a wavefront (now there are 4 ALU in each PE that keep busy to execute the "ADD" instruction) In cycle 5, a quarter of wavefront 2 begin to execute instruction "SUBTRACT". How ALU in processing element calculate the instruction since they are busy to calculate "ADD" instruction from first wavefront (Remember that the execution for instruction "ADD" for a quarter of wavefront in first cycle is unfinished since it take 8 cycle clock)??

Update: 8 cycle clock means the latency of read after write

Sorry, 8 cycle clock means the latency of read after write. I already read wikipedia and understand the concept but if one instruction takes eight cycle, how ALU execute instructions from second wavefront (cycle 5-8) if all ALUs in each processing element still execute instructions from first wavefront (cycle 0-4) that takes 8 cycle clock to complete?? — arvin99, Dec 27 '13 at 05:41

score 1 · Accepted Answer · answered Dec 28 '13 at 11:48

1

As you have stated, it takes 4 clock cycles for a wavefront to be processed. The results of that instruction get sent to the registers but, because of the read-after-write latency, these results will only be available after 8 clock cycles. The important distinction here is that the ALUs finished their work in 4 cycles so they can go on processing other instructions. The register memory takes 8 cycles to do its job, i.e. store the new data and make it visible again.

As a general note for all types of memory accesses, including registers: Memory accesses get handled differently to normal arithmetic, the ALUs can continue executing instructions that don't depend on the results of the memory access while waiting for the memory access to finish.

answered Dec 28 '13 at 11:48

chippies

1,595
10
20

1

What kind of register takes four clock cycles to propagate the input to the output? – John Dvorak Dec 28 '13 at 11:55
Apparently, AMDs. This is my interpretation of section 7.6.1 of the AMD Accelerated Parallel Processing OpenCL Programming Guide revision 2.7 (November 2013). Nvidia is similar according to section 9.2.6 of the CUDA 5.5 best practices guide. – chippies Dec 28 '13 at 13:43
Can ou quote it here? I can't access PDFs now. – John Dvorak Dec 28 '13 at 14:02
2

Part 1 of 2: 7.6.1 Hiding ALU and Memory Latency The read-after-write latency for most arithmetic operations (a floating-point add, for example) is only eight cycles. For most AMD GPUs, each compute unit can execute 16 VLIW instructions on each cycle. Each wavefront consists of 64 workitems; each compute unit executes a quarter-wavefront on each cycle, and the entire wavefront is executed in four consecutive cycles. Thus, to hide eight cycles of latency, the program must schedule two wavefronts. – chippies Dec 29 '13 at 14:14
2

Part 2 of 2: The compute unit executes the first wavefront on four consecutive cycles; it then immediately switches and executes the other wavefront for four cycles. Eight cycles have elapsed, and the ALU result from the first wavefront is ready, so the compute unit can switch back to the first wavefront and continue execution. Compute units running two wavefronts (128 threads) completely hide the ALU pipeline latency. – chippies Dec 29 '13 at 14:15
1

These two parts are the 1st paragraph of section 7.6.1 of the AMD OpenCL Programming Guide. – chippies Dec 29 '13 at 14:16

How ALU execute instruction in AMD GPU (VLIW)?

1 Answers1