2

Is there any relationship between blockIdx and the order in which thread blocks are executed on the GPU device?

My motivation is that I have a kernel in which multiple blocks will read from the same location in global memory, and it would be nice if these blocks would run concurrently (because L2 cache hits are nice). In deciding how to organize these blocks into a grid, would it be safe to say that blockIdx.x=0 is more likely to run concurrently with blockIdx.x=1 than with blockIdx.x=200? And that I should try to assign consecutive indices to blocks that read from the same location in global memory?

To be clear, I'm not asking about inter-block dependencies (as in this question) and the thread blocks are completely independent from the point of view of program correctness. I'm already using shared memory to broadcast data within a block, and I can't make the blocks any larger.

EDIT: Again, I am well aware that

Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series.

and the blocks are fully independent---they can run in any order and produce the same output. I am just asking if the order in which I arrange the blocks into a grid will influence which blocks end up running concurrently, because that does affect performance via L2 cache hit rate.

KQS
  • 1,547
  • 10
  • 21
  • 2
    From the [CUDA Programming Guide](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy): "Thread blocks are required to execute independently: It must be possible to execute them in _any order_, in parallel or in series." But you can use a global variable that you increment for every new block that just started and use that variable as your "block id". I am quite sure that there is at least one question about how to do this here on SO. – BlameTheBits Oct 10 '17 at 07:53
  • That's not what I'm asking, and I've edited the question to be more clear about this. I'm not trying to determine the actual order of execution, as you suggest; I'm asking whether the built-in variable `blockIdx` has any relation to the order in which blocks are issued for execution. – KQS Oct 10 '17 at 14:37
  • Many hardware features make me believe there is no way to predict any connection. For instance, there is no rule for scheduling a block on some particular sm, and it also depends on other processes, kernels that may be running concurrently, even probably memory operations. – Florent DUGUET Oct 10 '17 at 16:12

2 Answers2

5

I found a writeup in which a CS researcher used micro-benchmarking to reverse engineer the block scheduler on a Fermi device:

http://cs.rochester.edu/~sree/fermi-tbs/fermi-tbs.html

I adapted his code to run on my GPU device (GTX 1080, with the Pascal GP104 GPU) and to randomize the runtimes.

Methods

Each block contains only 1 thread, and is launched with enough shared memory that only 2 blocks can be resident per SM. The kernel records its start time (obtained via clock64()) and then runs for a random amount of time (the task, appropriately enough, is generating random numbers using the multiply-with-carry algorithm).

The GTX 1080 is comprised of 4 Graphics Processing Clusters (GPCs) with 5 streaming multiprocessors (SM) each. Each GPC has its own clock, so I used the same method described in the link to determine which SMs belonged to which GPCs and then subtract a fixed offset to convert all of the clock values to the same time zone.

Results

For a 1-D block grid, I found that the blocks were indeed launched in consecutive order:

Block start time for a 1-D block grid

We have 40 blocks starting immediately (2 blocks per SM * 20 SMs) and the subsequent blocks start when the previous blocks end.

For 2-D grids, I found the same linear-sequential order, with blockIdx.x being the fast dimension and blockIdx.y the slow dimension:

Block start time for a 2-D block grid

NB: I made a terrible typo when labeling these plots. All instances of "threadIdx" should be replaced with "blockIdx".

And for a 3-d block grid: Block start time for a 3-D block grid

Conclusions

For a 1-D grid, these results match what Dr. Pai reported in the linked writeup. For 2-D grids, however, I did not find any evidence for a space-filling curve in block execution order, so this may have changed somewhere between Fermi and Pascal.

And of course, the usual caveats with benchmarking apply, and there's no guarantee that this isn't specific to a particular processor model.

Appendix

For reference, here's a plot showing the results for random vs. fixed runtimes:

1-D grid with start and stop times

The fact that we see this trend with randomized runtimes gives me more confidence that this is a real result and not just a quirk of the benchmarking task.

Community
  • 1
  • 1
KQS
  • 1,547
  • 10
  • 21
  • This answer is a lot more comprehensive than mine, that I typed two hours ago but then didn't get to submit until after dinner with the colleagues - upvoted. – tera Oct 10 '17 at 19:21
  • 1
    all that effort to try to pin down something that could change tomorrow. Or you could take the hint given by @Shadow and simply **guarantee** block execution order, by substituting your own block index for the one handed you by the machine. – Robert Crovella Oct 11 '17 at 03:59
  • @RobertCrovella To be honest, I didn't understand what Shadow was trying to say until your comment made me read it again in a different light. If you post an atomic block counter as an answer I'll accept it. – KQS Oct 11 '17 at 18:06
2

Yes, there definitely is a correlation (although of course it is not guaranteed).

You are probably best off just trying it out on your device. You can use the %globaltimer and %smid special PTX registers with a bit of inline assembly:

#include <stdio.h>

__managed__ unsigned long long starttime;

__device__ unsigned long long globaltime(void)
{
    unsigned long long time;
    asm("mov.u64  %0, %%globaltimer;" : "=l"(time));
    return time;
}

__device__ unsigned int smid(void)
{
    unsigned int sm;
    asm("mov.u32  %0, %%smid;" : "=r"(sm));
    return sm;
}

__global__ void logkernel(void)
{
    unsigned long long t = globaltime();
    unsigned long long t0 = atomicCAS(&starttime, 0ull, t);
    if (t0==0) t0 = t;
    printf("Started block %2u on SM %2u at %llu.\n", blockIdx.x, smid(), t - t0);
}


int main(void)
{
    starttime = 0;
    logkernel<<<30, 1, 49152>>>();
    cudaDeviceSynchronize();

    return 0;
}

I've used 48K of shared memory to make the results a bit more interesting - you should substitute your kernel of interest with it's actual launch configuration instead.

If I run this code on my laptop with a GTX 1050, I get output like the following:

Started block  1 on SM  1 at 0.
Started block  6 on SM  1 at 0.
Started block  8 on SM  3 at 0.
Started block  0 on SM  0 at 0.
Started block  3 on SM  3 at 0.
Started block  5 on SM  0 at 0.
Started block  2 on SM  2 at 0.
Started block  7 on SM  2 at 0.
Started block  4 on SM  4 at 0.
Started block  9 on SM  4 at 0.
Started block 10 on SM  3 at 152576.
Started block 11 on SM  3 at 152576.
Started block 18 on SM  1 at 153600.
Started block 16 on SM  1 at 153600.
Started block 17 on SM  0 at 153600.
Started block 14 on SM  0 at 153600.
Started block 13 on SM  2 at 153600.
Started block 12 on SM  2 at 153600.
Started block 19 on SM  4 at 153600.
Started block 15 on SM  4 at 153600.
Started block 20 on SM  0 at 210944.
Started block 21 on SM  3 at 210944.
Started block 22 on SM  0 at 211968.
Started block 23 on SM  3 at 211968.
Started block 24 on SM  1 at 214016.
Started block 26 on SM  1 at 215040.
Started block 25 on SM  2 at 215040.
Started block 27 on SM  2 at 215040.
Started block 28 on SM  4 at 216064.
Started block 29 on SM  4 at 217088.

So you see there is indeed a strong correlation.

tera
  • 7,080
  • 1
  • 21
  • 32
  • Your code is a lot more elegant than mine, and I'm glad we got the same results. And now I learned that you can `printf` from inside a kernel! Thanks for posting code. – KQS Oct 10 '17 at 20:30