In NVIDIA gpu， Can ld/st and arithmetic instruction（such as int32 fp32 ）run simultaneously in same sm?

Question

Especially turing and ampere architecture，In the same sm and same warp scheduler，Can the warps run ld/st and other arithmetic instruction simultaneously?

I want to know about how warp scheduler work

score 2 · Answer 1 · answered Jan 17 '23 at 17:30

In the same sm and same warp scheduler，Can the warps run ld/st and other arithmetic instruction simultaneously?

No, not if "simultaneously" means "issued in the same clock cycle".

In current CUDA GPUs including turing and ampere, when the warp scheduler issues an instruction, it issues the same instruction to all threads in the warp, in any given clock cycle.

Different instructions could be run in different clock cycles (of course) and different instructions can be run in the same clock cycle, if those instructions are issued by different warp schedulers in the SM. This would also imply that those instructions are issued to distinct/separate SM units.

So, for example, an integer add instruction issued by warp scheduler 0 would have to be issued to separate functional units compared to a load/store instruction issued by warp scheduler 1 in the same SM. For this example, since the instructions are different, different functional units are needed anyway, and this is self-evident.

But even if both warp schedulers were issuing, for example, FADD (for 2 different warps), they would have to issue to separate floating-point functional units in the SM.

In modern CUDA GPUs, due to the partitioning of the SM, each warp scheduler has its own execution resources (functional units) for at least some instruction types, like FADD. So this would happen anyway, again, for this reason, in this example.

Use arithmetic instruction should use ld instruction to load data from memory to register before，and then execute arithmetic instruction. And this these instruction should use several clocks to wait for completion. So In one warp scheduler， for example In the SM 0 scheduler 0，the warp0 and warp1， Should the two warps warp0 and warp1 use different unit in one clock cycle？ For example warp0 execute ld instruction to load data to register， and then scheduler 0 save context wait for data load to specify register and then swith to warp1 to execute other instruction such as arithmetic instruction — sorfkc, Jan 18 '23 at 02:26
no, it doesn't work that way, currently. In modern GPUs, warps are statically assigned to schedulers, and warp0 has a separate instruction stream from warp1. All instructions from the instruction stream for warp 0 must be handled by warp 0. There is no "save context and switch" — Robert Crovella, Jan 18 '23 at 15:27

In NVIDIA gpu， Can ld/st and arithmetic instruction（such as int32 fp32 ）run simultaneously in same sm?

1 Answers1