Questions tagged [warp-scheduler]
12 questions
176
votes
2 answers
How do CUDA blocks/warps/threads map onto CUDA cores?
I have been using CUDA for a few weeks, but I have some doubts about the allocation of blocks/warps/thread.
I am studying the architecture from a didactic point of view (university project), so reaching peak performance is not my concern.
First of…

Daedalus
- 1,761
- 3
- 11
- 3
7
votes
1 answer
Why there are two warp schedulers in a SM of GPU?
I read NVIDIA Fermi whitepaper and get confused when I calculated the number of SP cores, schedulers.
According to the whitepaper, in each SM, there are two warp schedulers and two instruction dispatch units, allowing two warps to be issued and…

Dongwei Wang
- 475
- 5
- 14
5
votes
2 answers
blocks, threads, warpSize
There has been much discussion about how to choose the #blocks & blockSize, but I still missing something. Many of my concerns address this question: How CUDA Blocks/Warps/Threads map onto CUDA Cores? (To simplify the discussion, there is enough…

Doug
- 2,783
- 6
- 33
- 37
4
votes
1 answer
CUDA Warps and Thread Divergence
I am trying to understand CUDA warps and thread divergence. Suppose I have a naive matrix multiplication kernel to multiply n x n matrices.
__global__ void matrix_multiply(float* a, float* b, float* c, int n)
{
int row = blockIdx.y + blockDim.y…

csnate
- 1,601
- 4
- 19
- 31
3
votes
1 answer
cuda: warp divergence overhead vs extra arithmetic
Of course, warp divergence, via if and switch statements, is to be avoided at all costs on GPUs.
But what is the overhead of warp divergence (scheduling only some of the threads to execute certain lines) vs. additional useless arithmetic?
Consider…

cmo
- 3,762
- 4
- 36
- 64
1
vote
1 answer
In NVIDIA gpu, Can ld/st and arithmetic instruction(such as int32 fp32 )run simultaneously in same sm?
Especially turing and ampere architecture,In the same sm and same warp scheduler,Can the warps run ld/st and other arithmetic instruction simultaneously?
I want to know about how warp scheduler work

sorfkc
- 13
- 4
1
vote
1 answer
Questions of resident warps of CUDA
I have been using CUDA for a month, now i'm trying to make it clear that how many warps/blocks are needed to hide the latency of memory accesses. I think it is related to the maximum of resident warps on a multiprocessor.
According to Table.13 in…

Falofter
- 41
- 3
1
vote
1 answer
cuda shared memory and block execution scheduling
I would like to clear up an execution state with CUDA shared memory and block execution based on the amount of shared memory used per block.
State
I target on GTX480 nvidia card which has 48KB shared memory per block and 15 streaming…

zeus2
- 309
- 2
- 11
0
votes
1 answer
How to a warp cause another warp be in the Idle state?
As you can see in the title of the question, I want to know how a warp causes another warp go to the Idle state. I read a lot of the Q/A in the SO but I can not find the answer. At any time, just one warp in a block can be run? If so, the idle state…

Saeed Rahmani
- 650
- 1
- 8
- 29
0
votes
1 answer
Is there a way to explicitly map a thread to a specific warp in CUDA?
Say, dynamic analysis was done on a CUDA program such that certain threads were better off being in the same warp.
For example, let's pretend we have 1024 cuda threads and a warp size of 32. After dynamic analysis we find out that threads 989, 243,…

xfern
- 96
- 1
- 7
0
votes
1 answer
CUDA Kepler: not enough ALUs
According to the Kepler whitepage, the warp size for a Kepler based GPU is 32 and each multiprocessor contains 4 warp schedulars which select two independant instructions from a chosen warp. This means that each clock cycle, 32*4*2 = 256…

PieterV
- 816
- 10
- 23
0
votes
1 answer
What is the instruction issue time latency of the warp schedulers in CUDA?
I am under the impression that the (single) warp scheduler in compute capability 1.x GPUs issues one instruction per warp every 4 cycles, and since the latency of the arithmetic pipeline is 24 cycles, it can be completely hidden by having 6 active…

charis
- 429
- 6
- 16