CUDA device stack and synchronization; SSY instruction

Question

Edit: this question is a re-done version of the original, so the first several responses may no longer be relevant.

I'm curious about what impact a device function call with forced no-inlining has on synchronization within a device function. I have a simple test kernel that illustrates the behavior in question.

The kernel takes a buffer and passes it to a device function, along with a shared buffer and an indicator variable which identifies a single thread as the "boss" thread. The device function has divergent code: the boss thread first spends time doing trivial operations on the shared buffer, then writes to the global buffer. After a synchronization call, all threads write to the global buffer. After the kernel call, the host prints the contents of the global buffer. Here is the code:

CUDA CODE:

test_main.cu

#include<cutil_inline.h>
#include "test_kernel.cu"

int main()
{
  int scratchBufferLength = 100;
  int *scratchBuffer;
  int *d_scratchBuffer;

  int b = 1;
  int t = 64;

  // copy scratch buffer to device
  scratchBuffer = (int *)calloc(scratchBufferLength,sizeof(int));
  cutilSafeCall( cudaMalloc(&d_scratchBuffer,
        sizeof(int) * scratchBufferLength) );
  cutilSafeCall( cudaMemcpy(d_scratchBuffer, scratchBuffer,
        sizeof(int)*scratchBufferLength, cudaMemcpyHostToDevice) );

  // kernel call
  testKernel<<<b, t>>>(d_scratchBuffer);

  cudaThreadSynchronize();

  // copy data back to host
  cutilSafeCall( cudaMemcpy(scratchBuffer, d_scratchBuffer,
        sizeof(int) * scratchBufferLength, cudaMemcpyDeviceToHost) );

  // print results
  printf("Scratch buffer contents: \t");
  for(int i=0; i < scratchBufferLength; ++i)
  {
    if(i % 25 == 0)
      printf("\n");
    printf("%d ", scratchBuffer[i]);
  }
  printf("\n");

  //cleanup
  cudaFree(d_scratchBuffer);
  free(scratchBuffer);

  return 0;
}

test_kernel.cu

#ifndef __TEST_KERNEL_CU
#define __TEST_KERNEL_CU


#define IS_BOSS() (threadIdx.x == blockDim.x - 1)

__device__
__noinline__
void testFunc(int *sA, int *scratchBuffer, bool isBoss) {

  if(isBoss)  {   // produces unexpected output-- "broken" code
//if(IS_BOSS())  {    // produces expected output-- "working" code

    for (int c = 0; c < 10000; c++)  {
      sA[0] = 1;
    }
  }

  if(isBoss) {
    scratchBuffer[0] = 1;
  }

  __syncthreads();

  scratchBuffer[threadIdx.x ] = threadIdx.x;

  return;

}

__global__
void testKernel(int *scratchBuffer)
{
  __shared__ int sA[4];

  bool isBoss = IS_BOSS();

  testFunc(sA, scratchBuffer, isBoss);
  return;
}
#endif

I compiled this code from within the CUDA SDK to take advantage of the "cutilsafecall()" functions in test_main.cu, but of course these could be taken out if you'd like to compile outside the SDK. I compiled with CUDA Driver/Toolkit version 4.0, compute capability 2.0, and the code was run on a GeForce GTX 480, which has the Fermi architecture.

The expected output is

0 1 2 3 ... blockDim.x-1

However, the output I get is

1 1 2 3 ... blockDim.x-1

This seems to indicate that the boss thread executed the conditional "scratchBuffer[0] = 1;" statement AFTER all threads execute the "scratchBuffer[threadIdx.x] = threadIdx.x;" statement, even though they are separated by a __syncthreads() barrier.

This occurs even if the boss thread is instructed to write a sentinel value into the buffer position of a thread in its same warp; the sentinel is the final value present in the buffer, rather than the appropriate threadIdx.x .

One modification that causes the code to produce expected output is to change the conditional statement

if(isBoss) {

to

if(IS_BOSS()) {

; i.e., to change the divergence-controlling variable from being stored in a parameter register to being computed in a macro function. (Note the comments on the appropriate lines in the source code.) It's this particular change I've been focusing on to try and track down the problem. In looking at the disassembled .cubins of the kernel with the 'isBoss' conditional (i.e., broken code) and the 'IS_BOSS()' conditional (i.e., working code), the most conspicuous difference in the instructions seems to be the absence of an SSY instruction in the disassembled broken code.

Here are the disassembled kernels generated by disassembling the .cubin files with "cuobjdump -sass test_kernel.cubin" . everything up to the first 'EXIT' is the kernel, and everything after that is the device function. The only differences are in the device function.

DISASSEMBLED OBJECT CODE:

"broken" code

code for sm_20

    Function : _Z10testKernelPi
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x20009de428004000*/     MOV R2, c [0x0] [0x8];
/*0010*/     /*0x84001c042c000000*/     S2R R0, SR_Tid_X;
/*0018*/     /*0xfc015de428000000*/     MOV R5, RZ;
/*0020*/     /*0x00011de428004000*/     MOV R4, c [0x0] [0x0];
/*0028*/     /*0xfc209c034800ffff*/     IADD R2, R2, 0xfffff;
/*0030*/     /*0x9001dde428004000*/     MOV R7, c [0x0] [0x24];
/*0038*/     /*0x80019de428004000*/     MOV R6, c [0x0] [0x20];
/*0040*/     /*0x08001c03110e0000*/     ISET.EQ.U32.AND R0, R0, R2, pt;
/*0048*/     /*0x01221f841c000000*/     I2I.S32.S32 R8, -R0;
/*0050*/     /*0x2001000750000000*/     CAL 0x60;
/*0058*/     /*0x00001de780000000*/     EXIT;
/*0060*/     /*0x20201e841c000000*/     I2I.S32.S8 R0, R8;
/*0068*/     /*0xfc01dc231a8e0000*/     ISETP.NE.AND P0, pt, R0, RZ, pt;
/*0070*/     /*0xc00021e740000000*/     @!P0 BRA 0xa8;
/*0078*/     /*0xfc001de428000000*/     MOV R0, RZ;
/*0080*/     /*0x04001c034800c000*/     IADD R0, R0, 0x1;
/*0088*/     /*0x04009de218000000*/     MOV32I R2, 0x1;
/*0090*/     /*0x4003dc231a8ec09c*/     ISETP.NE.AND P1, pt, R0, 0x2710, pt;
/*0098*/     /*0x00409c8594000000*/     ST.E [R4], R2;
/*00a0*/     /*0x600005e74003ffff*/     @P1 BRA 0x80;
/*00a8*/     /*0x040001e218000000*/     @P0 MOV32I R0, 0x1;
/*00b0*/     /*0x0060008594000000*/     @P0 ST.E [R6], R0;
/*00b8*/     /*0xffffdc0450ee0000*/     BAR.RED.POPC RZ, RZ;
/*00c0*/     /*0x84001c042c000000*/     S2R R0, SR_Tid_X;
/*00c8*/     /*0x10011c03200dc000*/     IMAD.U32.U32 R4.CC, R0, 0x4, R6;
/*00d0*/     /*0x10009c435000c000*/     IMUL.U32.U32.HI R2, R0, 0x4;
/*00d8*/     /*0x08715c4348000000*/     IADD.X R5, R7, R2;
/*00e0*/     /*0x00401c8594000000*/     ST.E [R4], R0;
/*00e8*/     /*0x00001de790000000*/     RET;
    .................................

"working" code

code for sm_20

    Function : _Z10testKernelPi
/*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];
/*0008*/     /*0x20009de428004000*/     MOV R2, c [0x0] [0x8];
/*0010*/     /*0x84001c042c000000*/     S2R R0, SR_Tid_X;
/*0018*/     /*0xfc015de428000000*/     MOV R5, RZ;
/*0020*/     /*0x00011de428004000*/     MOV R4, c [0x0] [0x0];
/*0028*/     /*0xfc209c034800ffff*/     IADD R2, R2, 0xfffff;
/*0030*/     /*0x9001dde428004000*/     MOV R7, c [0x0] [0x24];
/*0038*/     /*0x80019de428004000*/     MOV R6, c [0x0] [0x20];
/*0040*/     /*0x08001c03110e0000*/     ISET.EQ.U32.AND R0, R0, R2, pt;
/*0048*/     /*0x01221f841c000000*/     I2I.S32.S32 R8, -R0;
/*0050*/     /*0x2001000750000000*/     CAL 0x60;
/*0058*/     /*0x00001de780000000*/     EXIT;
/*0060*/     /*0x20009de428004000*/     MOV R2, c [0x0] [0x8];
/*0068*/     /*0x8400dc042c000000*/     S2R R3, SR_Tid_X;
/*0070*/     /*0x20201e841c000000*/     I2I.S32.S8 R0, R8;
/*0078*/     /*0x4000000760000001*/     SSY 0xd0;
/*0080*/     /*0xfc209c034800ffff*/     IADD R2, R2, 0xfffff;
/*0088*/     /*0x0831dc031a8e0000*/     ISETP.NE.U32.AND P0, pt, R3, R2, pt;
/*0090*/     /*0xc00001e740000000*/     @P0 BRA 0xc8;
/*0098*/     /*0xfc009de428000000*/     MOV R2, RZ;
/*00a0*/     /*0x04209c034800c000*/     IADD R2, R2, 0x1;
/*00a8*/     /*0x04021de218000000*/     MOV32I R8, 0x1;
/*00b0*/     /*0x4021dc231a8ec09c*/     ISETP.NE.AND P0, pt, R2, 0x2710, pt;
/*00b8*/     /*0x00421c8594000000*/     ST.E [R4], R8;
/*00c0*/     /*0x600001e74003ffff*/     @P0 BRA 0xa0;
/*00c8*/     /*0xfc01dc33190e0000*/     ISETP.EQ.AND.S P0, pt, R0, RZ, pt;
/*00d0*/     /*0x040021e218000000*/     @!P0 MOV32I R0, 0x1;
/*00d8*/     /*0x0060208594000000*/     @!P0 ST.E [R6], R0;
/*00e0*/     /*0xffffdc0450ee0000*/     BAR.RED.POPC RZ, RZ;
/*00e8*/     /*0x10311c03200dc000*/     IMAD.U32.U32 R4.CC, R3, 0x4, R6;
/*00f0*/     /*0x10309c435000c000*/     IMUL.U32.U32.HI R2, R3, 0x4;
/*00f8*/     /*0x84001c042c000000*/     S2R R0, SR_Tid_X;
/*0100*/     /*0x08715c4348000000*/     IADD.X R5, R7, R2;
/*0108*/     /*0x00401c8594000000*/     ST.E [R4], R0;
/*0110*/     /*0x00001de790000000*/     RET;
    .................................

The "SSY" instruction is present in the working code but not the broken code. The cuobjdump manual describes the instruction with, "Set synchronization point; used before potentially divergent instructions." This makes me think that for some reason the compiler does not recognize the possibility of divergence in the broken code.

I also found that if I comment out the __noinline__ directive, then the code produces the expected output, and indeed the assembly produced by the otherwise "broken" and "working" versions is exactly identical. So, this makes me think that when a variable is passed via the call stack, that variable cannot be used to control divergence and a subsequent synchronization call; the compiler does not seem to recognize the possibility of divergence in that case, and therefore doesn't insert an "SSY" instruction. Does anyone know if this is indeed a legitimate limitation of CUDA, and if so, if this is documented anywhere?

Thanks in advance.

SSY is Set Sync Relative Address (note this a hardware instruction, not PTX, so the cuobjdump guide is what you need to look at). — talonmies, Sep 11 '12 at 20:31
SSY is not a Fermi-specific instruction, best I know. Of what nature are the "synchronization issues"? Hangs? Incorrect results? Based on my experience I would say the most likely causes of synchronization issues in CUDA code are: (1) use of __syncthreads() in a divergent code flow, (2) use of warp-synchronous programming. Without runnable source code it is not possible to provide a better diagnosis. — njuffa, Sep 11 '12 at 20:46
Thanks for the quick responses. The kernel is just a toy example that exposes the synchronization issue I was having with a more complicated kernel; I'll post the relevant code as an edit to my original post. I agree with the general diagnosis of CUDA sync issues; in this case, some threads seem to be breaking through a __synchthreads() barrier. The only time I've ever had that happen before was when a __syncthreads() call existed elsewhere in the code that was only reached by some threads and not others. In this case, there is only one __syncthreads() call, so that's probably not the issue. — user1663964, Sep 11 '12 at 21:00
I can't see anything wrong, but the code is not complete. Can you post an as-simple-as-possible, yet complete (i.e. I can just copy, paste, compile and run) example code? I suggest simplifying your example as much as possible first (try it without the objects and member functions, make the main function minimal, etc.) — harrism, Sep 11 '12 at 22:13
The code you have shown makes no sense. `memberFunc` has three arguments, yet the kernel call only passes two. I am *almost* willing to bet this a shared memory/compiler optimization conflict that would be fixed by declaring sA volatile. But if you are going to ask a question about why code works or doesn't work, post an *actual* compilable repro case. Most of what you have now is simply not at all helpful. — talonmies, Sep 11 '12 at 23:30
Looking at the posted SASS, it does not look to me that the two variants are eqivalent. In particular, in one case determination of the "boss index" compares thread index against 0x3f, in the other ("broken") case the code compares thread index with 0 to determine the boss index. A compiler issue of some kind cannot be exluded, but it would be premature to assume one without a compilable, runnable repro case in hand. BTW, which CUDA version is being used? Using "volatile" may simply mask whatever issue may exist (whether that be in the user code or in the compiler). — njuffa, Sep 12 '12 at 00:18
@user1663964: Don't post duplicate questions - fix this one instead. The new one is very likely to get closed and deleted. — talonmies, Sep 19 '12 at 05:58
I can't reproduce your results on my Fermi-based laptop. I get 0 1 2 3 ... blockDim.x - 1. I tried with nvcc/cudart 4.1 and 5.0, both running on a driver that supports 5.0. You are running on CUDA 4.0, which is 3 going on 4 releases old. Can you try to run with CUDA 4.2 or 5.0RC? Note that when I compile with default options I get a warning about __noinline__ being disabled because of the pointer parameters. With -arch=sm_20 I don't get this warning. In either case, I get correct results. Please also provide the compiler command line you are using. — harrism, Sep 25 '12 at 06:38
@harrism: I'll split my response into multiple comments due to the character limit. (1) Here's the nvcc command to generate the executable: /usr/local/cuda/bin/nvcc -gencode=arch=compute_20,code=\"sm_20,compute_20\" -gencode=arch=compute_20,code=\"sm_20,compute_20\" -o obj/x86_64/release/test_main.cu_20.o -c test_main.cu -m64 --compiler-options -fno-strict-aliasing -maxrregcount 63 -I/home/mercury/coles/opt/cudpp_src_2.0/include/ -I/home/mercury/coles/opt/cudpp_src_2.0/src/cudpp/ -I. -I/usr/local/cuda/include -I../../common/inc -I../../../shared//inc -DUNIX -O2 — user1663964, Sep 26 '12 at 22:16
(2) Unfortunately I don't have the ability to try my code with CUDA 4.2+ on the GTX480; I'm a grad student and I don't have the necessary permissions to install updated drivers on our lab's shared machine with the GPU. I do have access to a GTX680 card running CUDA 4.2, though. I ran the app on that card and found it to have the same behavior you did, compiler warnings and all. Since the 680 uses the Kepler architecture rather than the Fermi architecture, though, I don't think we can be sure they implement synchronization the same way, and that the behavior would manifest itself on both cards. — user1663964, Sep 26 '12 at 22:20
(3) Have you tried adding more delay to the first loop in testFunc() to keep the boss thread busy there for longer? I found that I didn't get unexpected behavior with fewer than 10k iterations of that loop, and your laptop GPU might be faster than the one I'm using or something. I also tried adding more delay for the boss thread when running on the GTX680, but no luck so far-- only generating expected output. — user1663964, Sep 26 '12 at 22:20
(4) That compiler warning is very interesting. Perhaps pointer parameters are intended to be disabled in all cases, but somehow my particular compiler options allowed it to slip by without tripping the warning? — user1663964, Sep 26 '12 at 22:21
No, it's a sm_20 capability. Besides, you could try hand-inlining your `__device__` function to see if that makes a difference. Since your lab's shared machine is so out of date, maybe you can convince someone to update it for you. — harrism, Sep 26 '12 at 23:49
@harrism: Okay, I did get an update to CUDA 4.2 on the Fermi card, and I'm also unable to reproduce the unexpected behavior. The compiler appears to handle the synchronization in a different way, since there's no "SSY" instruction in either version of the disassembled code anymore. So it seems that either the CUDA team decided to change their model and allow parameters to control divergence/synchronization, or it was a compiler glitch that got fixed, or other changes to the compiler fixed this particular issue as a side effect. Thanks for your help. — user1663964, Sep 29 '12 at 20:45

score 3 · Accepted Answer · answered Oct 01 '12 at 21:20

3

This appears to have simply been a compiler bug fixed in CUDA 4.1/4.2. Does not reproduce for the asker on CUDA 4.2.

answered Oct 01 '12 at 21:20

harrism

26,505
2
57
88

CUDA device stack and synchronization; SSY instruction

1 Answers1