So recently, AMD launched their new GPU architecture called rDNA in their new Navi GPU line up. After reading certain architecture deep-dive article and video, my understanding is this (feel free to correct if I am wrong):
Small workloads that need similar instruction to execute are called "threads".
The scheduler then arranges a bunch of those threads that require the same instruction together. Particularly in AMD GPU case, GCN and rDNA are designed to process 64 and 32 threads respectively.
The SIMD then process those clustered threads. But the difference is AMD GCN uses SIMD16, meaning 16 threads can be processed at once, while AMD rDNA uses SIMD32, meaning 32 threads can be processed at once.
Things should work flawlessly if the GPU has all 64 threads to be executed, but it would be a pain in the ass if it only needs to execute 1 thread. So only 1 SIMD16 Vector Unit is actually doing something productive, while the other three are just basically chilling.
The change in architecture means , with SIMD32, the GPU can eliminate potential bottle neck.
Hoever, every of those source keep saying "The SIMD16 design is better suited for computational workload"... This raised me some question:
1) Isn't SIMD32 design is just overall better in SIMD16 in every single way? If not then what exactly is the advantage of SIMD16 in computational work anyway?
2) For each 64 threads, 4 SIMD16 are doing the processing work simultaneously or serial? The reason I ask it the video from Engadget depicted the process as serialized while the video from Linus Tech Tips seem to hint it's parallel. This confused the hell out of me.
If everything is serial, then why AMD don't just go for SIMD64 or something?
If everything is parallel then I honestly do not see the advantage of the SIMD at all. On GCN, you have 4 SIMD16, and on rDNA, you have 2 SIMD32. If you process 1 thread on GCN with SIMD16, the time you run 1 SIMD16 should be equal to the time you run 4 SIMD16, because, again, they are parallel. Jumping to 2 SIMD32, the time you process 1 SIMD32 should be equal to the time you process 2 of them. In both case, you still have potentially 63 unused threads. So what's the point exatly.
I know my understanding must be flawed at some point, so I would love some deep explanation. Thanks you.