I wanna ask something about OpenCL programming. I understand that a quarter of wavefront can issue instruction for each cycle clock and it will need four cycle clock to call a wavefront. To finish the instruction in VLIW architecture, it will need eight cycle clock. So, calling another wavefront is a solution. If I call two wavefront then it will be eight cycle clocks. So after wavefront A is executed ( 4 cycle clock), then wavefront B is executed (another four cycle clock). After wavefront B is executed (the total cycle clock is 8), the wavefront A will be executed again with another instruction.
The question is:
How ALU execute another instruction if four ALU for each processing element is already used to execute another instruction??
For Example:
In cycle 1, work item 0-15 begin to execute instruction "ADD".
The first ALU in each processing element (total 16 PE in SIMD / compute unit) calculate "ADD" instruction.
It happens in cycle 2, 3, and 4 for a wavefront (now there are 4 ALU in each PE that keep busy to execute the "ADD" instruction)
In cycle 5, a quarter of wavefront 2 begin to execute instruction "SUBTRACT".
How ALU in processing element calculate the instruction since they are busy to calculate
"ADD" instruction from first wavefront (Remember that the execution for instruction "ADD" for a quarter of wavefront in first cycle is unfinished since it take 8 cycle clock)??
Update: 8 cycle clock means the latency of read after write