Performance of CUDA __threadfence() synchronization versus synchronization by two separate kernel calls

Question

I'm trying to understand how to use __threadfence(), as it seems like a powerful synchronization primitive that lets different blocks work together without going through the huge hassle of ending a kernel and starting a new one. The CUDA C Programming guide has an example of it (Appendix B.5), which is fleshed out in the "threadFenceReduction" sample in the SDK, so it seems like something we "should" be using.

However, when I have tried using __threadfence(), it is shockingly slow. See the code below for an example. From what I understand, __threadfence() should just make sure that all pending memory transfers from the current thread block are finished, before proceeding. Memory latency is somewhat better than a microsecond, I believe, so the total time to deal with the 64KB of memory transfers in the included code, on a GTX680, should be somewhere around a microsecond. Instead, the __threadfence() instruction seems to take around 20 microseconds! Instead of using __threadfence() to synchronize, I can instead end the kernel, and launch an entirely new kernel (in the same, default, stream so that it is synchronized), in less then a third of the time!

What is going on here? Does my code have a bug in it that I'm not noticing? Or is __threadfence() really 20x slower than it should be, and 6x slower than an entire kernel launch+cleanup?

Time for 1000 runs of the threadfence kernel: 27.716831 ms
Answer: 120
Time for 1000 runs of just the first 3 lines, including threadfence: 25.962912 ms
Synchronizing without threadfence, by splitting to two kernels: 7.653344 ms
Answer: 120

#include "cuda.h"
#include <cstdio>

__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__device__ int scratch[16];
__device__ int junk[16000];
__device__ int answer;

__global__ void usethreadfence() //just like the code example in B.5 of the CUDA C Programming Guide
{
    if (threadIdx.x==0) scratch[blockIdx.x]=blockIdx.x;
    junk[threadIdx.x+blockIdx.x*1000]=17+threadIdx.x; //do some more memory writes to make the kernel nontrivial
    __threadfence();

    if (threadIdx.x==0) {
        unsigned int value = atomicInc(&count, gridDim.x);
        isLastBlockDone = (value == (gridDim.x - 1));
    }
    __syncthreads();
    if (isLastBlockDone && threadIdx.x==0) {
    // The last block sums the results stored in scratch[0 .. gridDim.x-1]
        int sum=0;
        for (int i=0;i<gridDim.x;i++) sum+=scratch[i];
        answer=sum;
    }
}

__global__ void justthreadfence() //first three lines of the previous kernel, so we can compare speeds
{
    if (threadIdx.x==0) scratch[blockIdx.x]=blockIdx.x;
    junk[threadIdx.x+blockIdx.x*1000]=17+threadIdx.x;
    __threadfence();
}

__global__ void usetwokernels_1() //this and the next kernel reproduce the functionality of the first kernel, but faster!
{
    if (threadIdx.x==0) scratch[blockIdx.x]=blockIdx.x;
    junk[threadIdx.x+blockIdx.x*1000]=17+threadIdx.x;
}

__global__ void usetwokernels_2()
{
    if (threadIdx.x==0) {
        int sum=0;
        for (int i=0;i<gridDim.x;i++) sum+=scratch[i];
        answer=sum;
    }
}

int main() {
    int sum;

    cudaEvent_t start, stop; float time; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0);
    for (int i=0;i<1000;i++) usethreadfence<<<16,1000>>>();
    cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); printf ("Time for 1000 runs of the threadfence kernel: %f ms\n", time); cudaEventDestroy(start); cudaEventDestroy(stop);
    cudaMemcpyFromSymbol(&sum,answer,sizeof(int)); printf("Answer: %d\n",sum);

    cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0);
    for (int i=0;i<1000;i++) justthreadfence<<<16,1000>>>();
    cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); printf ("Time for 1000 runs of just the first 3 lines, including threadfence: %f ms\n", time); cudaEventDestroy(start); cudaEventDestroy(stop);

    cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0);
    for (int i=0;i<1000;i++) {usetwokernels_1<<<16,1000>>>(); usetwokernels_2<<<16,1000>>>();}
    cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); printf ("Synchronizing without threadfence, by splitting to two kernels: %f ms\n", time); cudaEventDestroy(start); cudaEventDestroy(stop);
    cudaMemcpyFromSymbol(&sum,answer,sizeof(int)); printf("Answer: %d\n",sum);
}

I would recommend time each version independently (one per execution). Your kernel execution time may be affected by the time required to create the cuda context. See this question / answer for cuda context details: http://stackoverflow.com/questions/13313930/difference-on-creating-a-cuda-context/13314612#13314612 — pQB, Jul 22 '13 at 09:51
By any chance are you running with CUDA 5.5 ? If so, is it possible for you to retry your test with CUDA 5.0? — Robert Crovella, Jul 23 '13 at 03:04
I've been using CUDA 5.0. The timing is from putting the code into a new CUDA 5.0 Visual Studio project, with all the default settings, except changing the compilation to compile for 3.0 devices, and putting it in release mode. But in general, I'm fine changing CUDA versions, getting a different CUDA card, etc. I'm just trying to understand how to write high performance code, and have been stumbling over this issue that what should be a lightweight synchronization primitive seems to be orders of magnitude more expensive than one would expect. — poyi, Jul 23 '13 at 06:27
I haven't tried windows yet. Under Linux with CUDA 5.0, I consistently get results where your `__threadfence()` method is slightly faster (~20%) than your two-kernel method. I've tried this on a few different machines/OS's/GPUs — Robert Crovella, Jul 23 '13 at 14:19
I don't have a windows machine with a GTX680 conveniently available. But I tried your code on my CUDA 5 Win7 64bit Quadro1000M (cc 2.1) laptop, and I consistently got 12ms for the threadfence method and 16ms for the two-kernel method. Your observation may be very specific to your system or to the GTX680, I'm not sure. — Robert Crovella, Jul 24 '13 at 01:08
Thanks Robert for the pointer to Linux. I've since done timing in Debian and Linux Mint, with a GTX460, as well as the 680-4GB from before. I added two more tests: just the first two lines of the kernel with no threadfence, and rerunning the last test in a non-default stream. Rough conclusions (running out of space here, not sure how I'm meant to do this): cost of threadsync; cost of launching 2 non-0 streams (in microseconds): 0.9; 0.2 - 460,debian. 6.3; 1.9 - 460,win7. 22.7; 4.0 - 680,debian-xfce. 28.7; 4.0 - 680,debian-kde. 22.3; 4.9 - 680,mint-kde. 22.3; 13 - 680,win7. — poyi, Jul 27 '13 at 00:53
The results above (if you can decipher them) are really frustrating. Latency for either of the two "fancy" synchronization features of threadsync, and running in a non-default stream are vastly slower on the 680 than on the 460, regardless of operating system (and Windows is generally slower than Linux). Since synchronization is the basic toolbox at the heart of writing any nontrivial parallel code, it's a really awkward situation when a newer card forces you to use an impoverished subset of the old features. Any guidance? A different high-end card I should be using? — poyi, Jul 27 '13 at 01:09
Also, I should say, the above tests were run across two computers, using the most recent compatible drivers I could install (320 and 326 beta for Windows). Older 300-level drivers for Linux. — poyi, Jul 27 '13 at 01:12

score 1 · Answer 1 · answered Jul 28 '14 at 07:24

I have tested your code, compiled with CUDA 6.0, on two different cards: GT540M (Fermi) and Kepler K20c (Kepler) and these are the results

GT540M

Time for 1000 runs of the threadfence kernel: 303.373688 ms
Answer: 120
Time for 1000 runs of just the first 3 lines, including threadfence: 300.395416 ms
Synchronizing without threadfence, by splitting to two kernels: 597.729919 ms
Answer: 120

Kepler K20c

Time for 1000 runs of the threadfence kernel: 10.164096 ms
Answer: 120
Time for 1000 runs of just the first 3 lines, including threadfence: 8.808896 ms
Synchronizing without threadfence, by splitting to two kernels: 17.330784 ms
Answer: 120

I do not observe any particularly slow behavior of __threadfence() against the other two considered cases.

This can be justified by resorting to the disassembled codes.

usethreadfence()

c[0xe][0x0] = scratch
c[0xe][0x4] = junk
c[0xe][0xc] = count
c[0x0][0x14] = gridDim.x

/*0000*/         MOV R1, c[0x1][0x100];                                     
/*0008*/         S2R R0, SR_TID.X;                                          R0 = threadIdx.x
/*0010*/         ISETP.NE.AND P0, PT, R0, RZ, PT;                           P0 = (R0 != 0)
/*0018*/         S2R R5, SR_CTAID.X;                                        R5 = blockIdx.x
/*0020*/         IMAD R3, R5, 0x3e8, R0;                                    R3 = R5 * 1000 + R0 = threadIdx.x + blockIdx.x * 1000
                                                                        if (threadIdx.x == 0)
/*0028*/    @!P0 ISCADD R2, R5, c[0xe][0x0], 0x2;                               R2 = scratch + threadIdx.x                           
/*0030*/         IADD R4, R0, 0x11;                                             R4 = R0 + 17 = threadIdx.x + 17
/*0038*/         ISCADD R3, R3, c[0xe][0x4], 0x2;                               R3 = junk + threadIdx.x + blockIdx.x * 1000
/*0040*/    @!P0 ST [R2], R5;                                                   scratch[threadIdx.x] = blockIdx.x
/*0048*/         ST [R3], R4;                                                   junk[threadIdx.x + blockIdx.x * 1000] = threadIdx.x + 17
/*0050*/         MEMBAR.GL;                                                     __threadfence
/*0058*/     @P0 BRA.U 0x98;                                                if (threadIdx.x != 0) branch to 0x98
                                                                        if (threadIdx.x == 0)
/*0060*/    @!P0 MOV R2, c[0xe][0xc];                                           R2 = &count
/*0068*/    @!P0 MOV R3, c[0x0][0x14];                                          R3 = gridDim.x
/*0070*/    @!P0 ATOM.INC R2, [R2], R3;                                         R2 = value = count + 1; *(&count) ++ 
/*0078*/    @!P0 IADD R3, R3, -0x1;                                             R3 = R3 - 1 = gridDim.x - 1
/*0080*/    @!P0 ISETP.EQ.AND P1, PT, R2, R3, PT;                               P1 = (R2 == R3) = 8 value == (gridDim.x - 1))
/*0088*/    @!P0 SEL R2, RZ, 0x1, !P1;                                          if (!P1) R2 = RZ otherwise R2 = 1 (R2 = isLastBlockDone)
/*0090*/    @!P0 STS.U8 [RZ], R2;                                               Stores R2 (i.e., isLastBlockDone) to shared memory to [0]
/*0098*/         ISETP.EQ.AND P0, PT, R0, RZ, PT;                           P0 = (R0 == 0) = (threadIdx.x == 0)
/*00a0*/         BAR.RED.POPC RZ, RZ, RZ, PT;                               __syncthreads()
/*00a8*/         LDS.U8 R0, [RZ];                                           R0 = R2 = isLastBlockDone
/*00b0*/         ISETP.NE.AND P0, PT, R0, RZ, P0;                           P0 = (R0 == 0)
/*00b8*/    @!P0 EXIT;                                                      if (isLastBlockDone != 0) exits
/*00c0*/         ISETP.NE.AND P0, PT, RZ, c[0x0][0x14], PT;                 IMPLEMENTING THE FOR LOOP WITH A LOOP UNROLL OF 4
/*00c8*/         MOV R0, RZ;
/*00d0*/    @!P0 BRA 0x1b8;
/*00d8*/         MOV R2, c[0x0][0x14];
/*00e0*/         ISETP.GT.AND P0, PT, R2, 0x3, PT;
/*00e8*/         MOV R2, RZ;
/*00f0*/    @!P0 BRA 0x170;
/*00f8*/         MOV R3, c[0x0][0x14];
/*0100*/         IADD R7, R3, -0x3;
/*0108*/         NOP;
/*0110*/         ISCADD R3, R2, c[0xe][0x0], 0x2;
/*0118*/         IADD R2, R2, 0x4;
/*0120*/         LD R4, [R3];
/*0128*/         ISETP.LT.U32.AND P0, PT, R2, R7, PT;
/*0130*/         LD R5, [R3+0x4];
/*0138*/         LD R6, [R3+0x8];
/*0140*/         LD R3, [R3+0xc];
/*0148*/         IADD R0, R4, R0;
/*0150*/         IADD R0, R5, R0;
/*0158*/         IADD R0, R6, R0;
/*0160*/         IADD R0, R3, R0;
/*0168*/     @P0 BRA 0x110;
/*0170*/         ISETP.LT.U32.AND P0, PT, R2, c[0x0][0x14], PT;
/*0178*/    @!P0 BRA 0x1b8;
/*0180*/         ISCADD R3, R2, c[0xe][0x0], 0x2;
/*0188*/         IADD R2, R2, 0x1;
/*0190*/         LD R3, [R3];
/*0198*/         ISETP.LT.U32.AND P0, PT, R2, c[0x0][0x14], PT;
/*01a0*/         NOP;
/*01a8*/         IADD R0, R3, R0;
/*01b0*/     @P0 BRA 0x180;
/*01b8*/         MOV R2, c[0xe][0x8];
/*01c0*/         ST [R2], R0;
/*01c8*/         EXIT;

justthreadfence()

    Function : _Z15justthreadfencev
.headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
    /*0000*/         MOV R1, c[0x1][0x100];                 /* 0x2800440400005de4 */
    /*0008*/         S2R R3, SR_TID.X;                      /* 0x2c0000008400dc04 */
    /*0010*/         ISETP.NE.AND P0, PT, R3, RZ, PT;       /* 0x1a8e0000fc31dc23 */
    /*0018*/         S2R R4, SR_CTAID.X;                    /* 0x2c00000094011c04 */
    /*0020*/         IMAD R2, R4, 0x3e8, R3;                /* 0x2006c00fa0409ca3 */
    /*0028*/    @!P0 ISCADD R0, R4, c[0xe][0x0], 0x2;       /* 0x4000780000402043 */
    /*0030*/         IADD R3, R3, 0x11;                     /* 0x4800c0004430dc03 */
    /*0038*/         ISCADD R2, R2, c[0xe][0x4], 0x2;       /* 0x4000780010209c43 */
    /*0040*/    @!P0 ST [R0], R4;                           /* 0x9000000000012085 */
    /*0048*/         ST [R2], R3;                           /* 0x900000000020dc85 */
    /*0050*/         MEMBAR.GL;                             /* 0xe000000000001c25 */
    /*0058*/         EXIT;                                  /* 0x8000000000001de7 */

usetwokernels_1()

    Function : _Z15usetwokernels_1v
.headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
    /*0000*/         MOV R1, c[0x1][0x100];                 /* 0x2800440400005de4 */
    /*0008*/         S2R R0, SR_TID.X;                      /* 0x2c00000084001c04 */
    /*0010*/         ISETP.NE.AND P0, PT, R0, RZ, PT;       /* 0x1a8e0000fc01dc23 */
    /*0018*/         S2R R2, SR_CTAID.X;                    /* 0x2c00000094009c04 */
    /*0020*/         IMAD R4, R2, 0x3e8, R0;                /* 0x2000c00fa0211ca3 */
    /*0028*/    @!P0 ISCADD R3, R2, c[0xe][0x0], 0x2;       /* 0x400078000020e043 */
    /*0030*/         IADD R0, R0, 0x11;                     /* 0x4800c00044001c03 */
    /*0038*/         ISCADD R4, R4, c[0xe][0x4], 0x2;       /* 0x4000780010411c43 */
    /*0040*/    @!P0 ST [R3], R2;                           /* 0x900000000030a085 */
    /*0048*/         ST [R4], R0;                           /* 0x9000000000401c85 */
    /*0050*/         EXIT;                                  /* 0x8000000000001de7 */
    .....................................

usetwokernels_1()

    Function : _Z15usetwokernels_2v
.headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
    /*0000*/         MOV R1, c[0x1][0x100];                          /* 0x2800440400005de4 */
    /*0008*/         S2R R0, SR_TID.X;                               /* 0x2c00000084001c04 */
    /*0010*/         ISETP.NE.AND P0, PT, R0, RZ, PT;                /* 0x1a8e0000fc01dc23 */
    /*0018*/     @P0 EXIT;                                           /* 0x80000000000001e7 */
    /*0020*/         ISETP.NE.AND P0, PT, RZ, c[0x0][0x14], PT;      /* 0x1a8e400053f1dc23 */
    /*0028*/         MOV R0, RZ;                                     /* 0x28000000fc001de4 */
    /*0030*/    @!P0 BRA 0x130;                                      /* 0x40000003e00021e7 */
    /*0038*/         MOV R2, c[0x0][0x14];                           /* 0x2800400050009de4 */
    /*0040*/         ISETP.GT.AND P0, PT, R2, 0x3, PT;               /* 0x1a0ec0000c21dc23 */
    /*0048*/         MOV R2, RZ;                                     /* 0x28000000fc009de4 */
    /*0050*/    @!P0 BRA 0xe0;                                       /* 0x40000002200021e7 */
    /*0058*/         MOV R3, c[0x0][0x14];                           /* 0x280040005000dde4 */
    /*0060*/         IADD R7, R3, -0x3;                              /* 0x4800fffff431dc03 */
    /*0068*/         NOP;                                            /* 0x4000000000001de4 */
    /*0070*/         NOP;                                            /* 0x4000000000001de4 */
    /*0078*/         NOP;                                            /* 0x4000000000001de4 */
    /*0080*/         ISCADD R3, R2, c[0xe][0x0], 0x2;                /* 0x400078000020dc43 */
    /*0088*/         LD R4, [R3];                                    /* 0x8000000000311c85 */
    /*0090*/         IADD R2, R2, 0x4;                               /* 0x4800c00010209c03 */
    /*0098*/         LD R5, [R3+0x4];                                /* 0x8000000010315c85 */
    /*00a0*/         ISETP.LT.U32.AND P0, PT, R2, R7, PT;            /* 0x188e00001c21dc03 */
    /*00a8*/         LD R6, [R3+0x8];                                /* 0x8000000020319c85 */
    /*00b0*/         LD R3, [R3+0xc];                                /* 0x800000003030dc85 */
    /*00b8*/         IADD R0, R4, R0;                                /* 0x4800000000401c03 */
    /*00c0*/         IADD R0, R5, R0;                                /* 0x4800000000501c03 */
    /*00c8*/         IADD R0, R6, R0;                                /* 0x4800000000601c03 */
    /*00d0*/         IADD R0, R3, R0;                                /* 0x4800000000301c03 */
    /*00d8*/     @P0 BRA 0x80;                                       /* 0x4003fffe800001e7 */
    /*00e0*/         ISETP.LT.U32.AND P0, PT, R2, c[0x0][0x14], PT;  /* 0x188e40005021dc03 */
    /*00e8*/    @!P0 BRA 0x130;                                      /* 0x40000001000021e7 */
    /*00f0*/         NOP;                                            /* 0x4000000000001de4 */
    /*00f8*/         NOP;                                            /* 0x4000000000001de4 */
    /*0100*/         ISCADD R3, R2, c[0xe][0x0], 0x2;                /* 0x400078000020dc43 */
    /*0108*/         IADD R2, R2, 0x1;                               /* 0x4800c00004209c03 */
    /*0110*/         LD R3, [R3];                                    /* 0x800000000030dc85 */
    /*0118*/         ISETP.LT.U32.AND P0, PT, R2, c[0x0][0x14], PT;  /* 0x188e40005021dc03 */
    /*0120*/         IADD R0, R3, R0;                                /* 0x4800000000301c03 */
    /*0128*/     @P0 BRA 0x100;                                      /* 0x4003ffff400001e7 */
    /*0130*/         MOV R2, c[0xe][0x8];                            /* 0x2800780020009de4 */
    /*0138*/         ST [R2], R0;                                    /* 0x9000000000201c85 */
    /*0140*/         EXIT;                                           /* 0x8000000000001de7 */
    .....................................

As it can be seen, the instructions of justthreadfencev() are strictly contained in those of usethreadfence(), while those of usetwokernels_1() and usetwokernels_2() are practically a partitioning of those of justthreadfencev(). So, the difference in timings could be ascribed to the kernel launch overhead of the second kernel.

Performance of CUDA __threadfence() synchronization versus synchronization by two separate kernel calls

1 Answers1