CUDA C using single precision flop on doubles

Question

The problem

During a project in CUDA C, I came across unexpected behaviour regarding single precision and double precision floating point operations. In the project, I first fill an array with number in a kernel and in another kernel, I do some computation on these numbers. All variables and arrays are double precision, so I would not expect any single precision floating point operation to happen. However, if I analyze the executable of the program using NVPROF, it shows that single precision operations are executed. How is this possible?

Minimal, Complete, and Verifiable example

Here is the smallest program, that shows this behaviour on my architecture: (asserts and error catching has been left out). I use a Nvidia Tesla k40 graphics card.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define Nx 10
#define Ny 10
#define RANDOM double(0.236954587566)

__global__ void test(double *array, size_t pitch){
    double rho, u;
    int x = threadIdx.x + blockDim.x*blockIdx.x;
    int y = threadIdx.y + blockDim.y*blockIdx.y;
    int idx = y*(pitch/sizeof(double)) + 2*x;

    if(x < Nx && y < Ny){
        rho = array[idx]; 
        u = array[idx+1]/rho;
        array[idx] = rho*u;
    }
}

__global__ void fill(double *array, size_t pitch){
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;  
    int idx = y*(pitch/sizeof(double)) + 2*x;

    if(x < Nx || y < Ny){
        array[idx] = RANDOM*idx;
        array[idx + 1] = idx*idx*RANDOM;
    }
}

int main(int argc, char* argv[]) {
    double *d_array;
    size_t pitch;
    cudaMallocPitch((void **) &d_array, &pitch, 2*Nx*sizeof(double), Ny);

    dim3 threadDistribution = dim3(8,8);
    dim3 blockDistribution = dim3( (Nx + threadDistribution.x - 1) / (threadDistribution.x), (Ny + threadDistribution.y) / (threadDistribution.y));
    fill <<< blockDistribution, threadDistribution >>> (d_array, pitch);
    cudaDeviceSynchronize();
    test <<< blockDistribution, threadDistribution >>> (d_array, pitch);

    return 0;
}

The output of NVPROF (edited to make it more readable, if you need the full output, just ask in the comments):

....
Device "Tesla K40c (0)"
Kernel: test(double*, unsigned long)
      Metric Name             Min         Max         Avg
      flop_count_sp           198         198         198
      flop_count_sp_add         0           0           0
      flop_count_sp_mul         0           0           0
      flop_count_sp_fma        99          99          99
      flop_count_sp_special   102         102         102
      flop_count_dp          1214        1214        1214
      flop_count_dp_add         0           0           0
      flop_count_dp_mul       204         204         204
      flop_count_dp_fma       505         505         505

What I've found so far

I found that if I delete the division in line 16:

u = array[idx+1]/rho;
==>
u = array[idx+1];

the output is as expected: zero single precision operations and exactly 100 double precision operations are executed. Does anyone know why the division causes the program to use single precision flop and 10 times more double precision floating point operations? I've also tried using intrinsics (__ddiv_rn), but this didn't solve the problem.

Many thanks in advance!

Edit - Working solution

Altough I still haven't figured out why it uses the single precision, I have found a 'solution' to this problem, thanks to @EOF. Replacing the division by multiplication with the reciprocal of rho did the job:

u = array[idx+1]/rho;
==>
u = array[idx+1]*__drcp_rn(rho);

I'll let someone else answer more authoritatively, but CUDA devices don't have hardware support for floating-point division. Instead, division will invoke a compiler-generated subroutine. It's certainly plausible that said subroutine uses some single-precision math internally to implement the division algorithm. — Jason R, Jan 16 '17 at 15:58
I'd expect the device to use some kind of reciprocal approximation that requires iterative refinement. — EOF, Jan 16 '17 at 15:58
@EOF : Yes, and doing the first few iterations in SP would be entirely sensible. — Martin Bonner supports Monica, Jan 16 '17 at 16:02
Since your code is C and the question is about the C language, why the C++ tag? C and C++ are different languages. For example, in C++ you can have `std::vector` which doesn't exist in C. — Thomas Matthews, Jan 16 '17 at 17:08
@thomas Matthews: the code isn't C and it is compiled with a C++ compiler. And the device math library is implemented as templated overloads of the standard C++ math library. It might not look like C++ to you, but in most certainly isn't C. — talonmies, Jan 16 '17 at 18:11
@talonmies: How is the code C++? The includes `stdio.h`, `stdlib.h`, and `math.h` are C libraries. In C++, system includes don't have extensions. — Thomas Matthews, Jan 16 '17 at 18:17
@thomas Matthews: the code you see goes through parser which emits C++. A C++ compiler compiles the code. The CUDA libraries which are silently imported are C++ template libraries. This code isn't C even if you think it looks like it — talonmies, Jan 16 '17 at 18:26
@ThomasMatthews: I've updated the tags, thank's for the notification. I wanted to write CUDA C code. I compiled using nvcc with gcc specified. Math.h is also available for C, not only c++. — Frank, Jan 16 '17 at 18:46
@EOF. I thought of this as well, but I couldn't find documentation on it. Do you have any idea where I can find more information on this topic? Also, will the output have double precision? — Frank, Jan 16 '17 at 18:51
@Frank Which CUDA version did you use to compile this code? What compiler switches did you pass, in particular, what architecture target did you specify? Using CUDA 7.5 to build the code as posted, with `-arch=sm_35`, then disassembled with `cuobjdump --dump-sass`, I only see two `MUFU.RCP64H` instructions as part of the called double-precision division subroutine, which arguably count as `flop_count_sp_special`. I do not see any instructions that I think would classify as `flop_count_sp_fma`. — njuffa, Jan 16 '17 at 18:56
@Frank: I don't actually know much about gpgpu, but cpus tend to have instructions to find approximate reciprocals (and approximate reciprocal square roots) quickly and with more throughput than the exact divide/square root instructions. A software example is the "fast inverse square root" of Quake fame. The approximations can be improved by something like Newton-Raphson or Taylor-series iterative refinement. You *can* probably get `double`-precision eventually, but it'll take quite a few instructions. — EOF, Jan 16 '17 at 18:57
@njuffa: I used CUDA 8.0 and architecture sm_35. I'm not familiar with the disassembeling and MUFU.RCP64H, but I'll definitely do some research into it. Thanks! — Frank, Jan 16 '17 at 19:02
@EOF Thanks for the info! As you can see in the post, multiplying by the reciprocal does not require any single precision operations. I guess you are right about the way the GPU calculates a division. — Frank, Jan 16 '17 at 19:04
@Frank To disassemble just use `cuobjdump --dump-sass [executable name]`. Single-precision instructions typically start with `F`, e.g. `FFMA`, `FMUL`; not always, e.g. `I2F.F32.*` is integer to single-precision conversion. `MUFU.RCPH64` is a reciprocal approximation instruction that is implemented in the *mu*lti*fu*nction unit. It is possible that the double-precision division subroutine changed in CUDA 8.0 (presumably better optimized). Look for the first `CAL` instruction in the disassembly to find the call to the division; inside that subroutine, there may be another `CAL` to slow-path code — njuffa, Jan 16 '17 at 19:08
@njuffa The MUFU.RCP64H apparently is counted as *Floating Point Operations (Single Precision Special)* in the profiler. There is another single precision FFMA in the code that is apparently used on the high word of the iterative correction as a faster conditional compared to 64 bit maths. — tera, Jan 16 '17 at 19:19
The use of FFMA with one argument being the zero register seems peculiar to me, but with all effects on flags/predicates being undocumented I thought I'd better leave it to you to explain the code. — tera, Jan 16 '17 at 19:24
@tera I suspected something like this might have been introduced as an optimization, but I don't have access to CUDA 8.0 at the moment. I would suggest someone post an answer showing the relevant disassembly of the double-precision division routine from CUDA 8.0, highlighting the single-precision floating-point instructions. The question is not "why" the instructions are there, only where they are coming from. Answer: From inside the double-precision division code. — njuffa, Jan 16 '17 at 19:25

tera · Accepted Answer · 2017-01-17T01:15:35.130

As others have pointed out, CUDA devices do not have instructions for floating point division in hardware. Instead they start from an initial approximation to the reciprocal of the denominator, provided by a single precision special function unit. It's product with the numerator is then iteratively refined until it matches the fraction to within machine precision.

Even the __ddiv_rn() intrinsic is compiled to this instruction sequence by ptxas, so it's use makes no difference.

You can gain closer insight by inspecting the code yourself using cuobjdump -sass, although this is made difficult by no official documentation for shader assembly being available other than the bare list of instructions.

I'll use the following bare-bones division kernel as an example:

__global__ void div(double x, double y, double *z) {
    *z = x / y;
}

This is compiled to the following shader assembly for a compute capability 3.5 device:

    Function : _Z3divddPd
.headerflags    @"EF_CUDA_SM35 EF_CUDA_PTX_SM(EF_CUDA_SM35)"
                                                                                          /* 0x08a0109c10801000 */
    /*0008*/                   MOV R1, c[0x0][0x44];                                      /* 0x64c03c00089c0006 */
    /*0010*/                   MOV R0, c[0x0][0x14c];                                     /* 0x64c03c00299c0002 */
    /*0018*/                   MOV32I R2, 0x1;                                            /* 0x74000000009fc00a */
    /*0020*/                   MOV R8, c[0x0][0x148];                                     /* 0x64c03c00291c0022 */
    /*0028*/                   MOV R9, c[0x0][0x14c];                                     /* 0x64c03c00299c0026 */
    /*0030*/                   MUFU.RCP64H R3, R0;                                        /* 0x84000000031c000e */
    /*0038*/                   MOV32I R0, 0x35b7333;                                      /* 0x7401adb9999fc002 */
                                                                                          /* 0x08a080a080a4a4a4 */
    /*0048*/                   DFMA R4, -R8, R2, c[0x2][0x0];                             /* 0x9b880840001c2012 */
    /*0050*/                   DFMA R4, R4, R4, R4;                                       /* 0xdb801000021c1012 */
    /*0058*/                   DFMA R4, R4, R2, R2;                                       /* 0xdb800800011c1012 */
    /*0060*/                   DMUL R6, R4, c[0x0][0x140];                                /* 0x64000000281c101a */
    /*0068*/                   FSETP.GE.AND P0, PT, R0, |c[0x0][0x144]|, PT;              /* 0x5db09c00289c001e */
    /*0070*/                   DFMA R8, -R8, R6, c[0x0][0x140];                           /* 0x9b881800281c2022 */
    /*0078*/                   MOV R2, c[0x0][0x150];                                     /* 0x64c03c002a1c000a */
                                                                                          /* 0x0880acb0a0ac8010 */
    /*0088*/                   MOV R3, c[0x0][0x154];                                     /* 0x64c03c002a9c000e */
    /*0090*/                   DFMA R4, R8, R4, R6;                                       /* 0xdb801800021c2012 */
    /*0098*/               @P0 BRA 0xb8;                                                  /* 0x120000000c00003c */
    /*00a0*/                   FFMA R0, RZ, c[0x0][0x14c], R5;                            /* 0x4c001400299ffc02 */
    /*00a8*/                   FSETP.GT.AND P0, PT, |R0|, c[0x2][0x8], PT;                /* 0x5da01c40011c021e */
    /*00b0*/               @P0 BRA 0xe8;                                                  /* 0x120000001800003c */
    /*00b8*/                   MOV R4, c[0x0][0x140];                                     /* 0x64c03c00281c0012 */
                                                                                          /* 0x08a1b810b8008010 */
    /*00c8*/                   MOV R5, c[0x0][0x144];                                     /* 0x64c03c00289c0016 */
    /*00d0*/                   MOV R7, c[0x0][0x14c];                                     /* 0x64c03c00299c001e */
    /*00d8*/                   MOV R6, c[0x0][0x148];                                     /* 0x64c03c00291c001a */
    /*00e0*/                   CAL 0xf8;                                                  /* 0x1300000008000100 */
    /*00e8*/                   ST.E.64 [R2], R4;                                          /* 0xe5800000001c0810 */
    /*00f0*/                   EXIT;                                                      /* 0x18000000001c003c */
    /*00f8*/                   LOP32I.AND R0, R7, 0x40000000;                             /* 0x20200000001c1c00 */
                                                                                          /* 0x08a08010a010b010 */
    /*0108*/                   MOV32I R15, 0x1ff00000;                                    /* 0x740ff800001fc03e */
    /*0110*/                   ISETP.LT.U32.AND P0, PT, R0, c[0x2][0xc], PT;              /* 0x5b101c40019c001e */
    /*0118*/                   MOV R8, RZ;                                                /* 0xe4c03c007f9c0022 */
    /*0120*/                   SEL R9, R15, c[0x2][0x10], !P0;                            /* 0x65002040021c3c26 */
    /*0128*/                   MOV32I R12, 0x1;                                           /* 0x74000000009fc032 */
    /*0130*/                   DMUL R10, R8, R6;                                          /* 0xe4000000031c202a */
    /*0138*/                   LOP32I.AND R0, R5, 0x7f800000;                             /* 0x203fc000001c1400 */
                                                                                          /* 0x08a0108ca01080a0 */
    /*0148*/                   MUFU.RCP64H R13, R11;                                      /* 0x84000000031c2c36 */
    /*0150*/                   DFMA R16, -R10, R12, c[0x2][0x0];                          /* 0x9b883040001c2842 */
    /*0158*/                   ISETP.LT.U32.AND P0, PT, R0, c[0x2][0x14], PT;             /* 0x5b101c40029c001e */
    /*0160*/                   MOV R14, RZ;                                               /* 0xe4c03c007f9c003a */
    /*0168*/                   DFMA R16, R16, R16, R16;                                   /* 0xdb804000081c4042 */
    /*0170*/                   SEL R15, R15, c[0x2][0x10], !P0;                           /* 0x65002040021c3c3e */
    /*0178*/                   SSY 0x3a0;                                                 /* 0x1480000110000000 */
                                                                                          /* 0x08acb4a4a4a4a480 */
    /*0188*/                   DMUL R14, R14, R4;                                         /* 0xe4000000021c383a */
    /*0190*/                   DFMA R12, R16, R12, R12;                                   /* 0xdb803000061c4032 */
    /*0198*/                   DMUL R16, R14, R12;                                        /* 0xe4000000061c3842 */
    /*01a0*/                   DFMA R10, -R10, R16, R14;                                  /* 0xdb883800081c282a */
    /*01a8*/                   DFMA R10, R10, R12, R16;                                   /* 0xdb804000061c282a */
    /*01b0*/                   DSETP.LEU.AND P0, PT, |R10|, RZ, PT;                       /* 0xdc581c007f9c2a1e */
    /*01b8*/              @!P0 BRA 0x1e0;                                                 /* 0x120000001020003c */
                                                                                          /* 0x088010b010b8acb4 */
    /*01c8*/                   DSETP.EQ.AND P0, PT, R10, RZ, PT;                          /* 0xdc101c007f9c281e */
    /*01d0*/              @!P0 BRA 0x358;                                                 /* 0x12000000c020003c */
    /*01d8*/                   DMUL.S R8, R4, R6;                                         /* 0xe4000000035c1022 */
    /*01e0*/                   ISETP.GT.U32.AND P0, PT, R0, c[0x2][0x18], PT;             /* 0x5b401c40031c001e */
    /*01e8*/                   MOV32I R0, 0x1ff00000;                                     /* 0x740ff800001fc002 */
    /*01f0*/                   MOV R14, RZ;                                               /* 0xe4c03c007f9c003a */
    /*01f8*/                   SEL R15, R0, c[0x2][0x10], !P0;                            /* 0x65002040021c003e */
                                                                                          /* 0x08b4a49c849c849c */
    /*0208*/                   DMUL R12, R10, R8;                                         /* 0xe4000000041c2832 */
    /*0210*/                   DMUL R18, R10, R14;                                        /* 0xe4000000071c284a */
    /*0218*/                   DMUL R10, R12, R14;                                        /* 0xe4000000071c302a */
    /*0220*/                   DMUL R16, R8, R18;                                         /* 0xe4000000091c2042 */
    /*0228*/                   DFMA R8, R10, R6, -R4;                                     /* 0xdb901000031c2822 */
    /*0230*/                   DFMA R12, R16, R6, -R4;                                    /* 0xdb901000031c4032 */
    /*0238*/                   DSETP.GT.AND P0, PT, |R8|, |R12|, PT;                      /* 0xdc209c00061c221e */
                                                                                          /* 0x08b010ac10b010a0 */
    /*0248*/                   SEL R9, R17, R11, P0;                                      /* 0xe5000000059c4426 */
    /*0250*/                   FSETP.GTU.AND P1, PT, |R9|, 1.469367938527859385e-39, PT;  /* 0xb5e01c00801c263d */
    /*0258*/                   MOV R11, R9;                                               /* 0xe4c03c00049c002e */
    /*0260*/                   SEL R8, R16, R10, P0;                                      /* 0xe5000000051c4022 */
    /*0268*/               @P1 NOP.S;                                                     /* 0x8580000000443c02 */
    /*0270*/                   FSETP.LT.AND P0, PT, |R5|, 1.5046327690525280102e-36, PT;  /* 0xb5881c20001c161d */
    /*0278*/                   MOV32I R0, 0x3ff00000;                                     /* 0x741ff800001fc002 */
                                                                                          /* 0x0880a48090108c10 */
    /*0288*/                   MOV R16, RZ;                                               /* 0xe4c03c007f9c0042 */
    /*0290*/                   SEL R17, R0, c[0x2][0x1c], !P0;                            /* 0x65002040039c0046 */
    /*0298*/                   LOP.OR R10, R8, 0x1;                                       /* 0xc2001000009c2029 */
    /*02a0*/                   LOP.AND R8, R8, -0x2;                                      /* 0xca0003ffff1c2021 */
    /*02a8*/                   DMUL R4, R16, R4;                                          /* 0xe4000000021c4012 */
    /*02b0*/                   DMUL R6, R16, R6;                                          /* 0xe4000000031c401a */
    /*02b8*/                   DFMA R14, R10, R6, -R4;                                    /* 0xdb901000031c283a */
                                                                                          /* 0x08b010b010a0b4a4 */
    /*02c8*/                   DFMA R12, R8, R6, -R4;                                     /* 0xdb901000031c2032 */
    /*02d0*/                   DSETP.GT.AND P0, PT, |R12|, |R14|, PT;                     /* 0xdc209c00071c321e */
    /*02d8*/                   SEL R8, R10, R8, P0;                                       /* 0xe5000000041c2822 */
    /*02e0*/                   LOP.AND R0, R8, 0x1;                                       /* 0xc2000000009c2001 */
    /*02e8*/                   IADD R11.CC, R8, -0x1;                                     /* 0xc88403ffff9c202d */
    /*02f0*/                   ISETP.EQ.U32.AND P0, PT, R0, 0x1, PT;                      /* 0xb3201c00009c001d */
    /*02f8*/                   IADD.X R0, R9, -0x1;                                       /* 0xc88043ffff9c2401 */
                                                                                          /* 0x08b4a480a010b010 */
    /*0308*/                   SEL R10, R11, R8, !P0;                                     /* 0xe5002000041c2c2a */
    /*0310*/               @P0 IADD R8.CC, R8, 0x1;                                       /* 0xc084000000802021 */
    /*0318*/                   SEL R11, R0, R9, !P0;                                      /* 0xe5002000049c002e */
    /*0320*/               @P0 IADD.X R9, R9, RZ;                                         /* 0xe08040007f802426 */
    /*0328*/                   DFMA R14, R10, R6, -R4;                                    /* 0xdb901000031c283a */
    /*0330*/                   DFMA R4, R8, R6, -R4;                                      /* 0xdb901000031c2012 */
    /*0338*/                   DSETP.GT.AND P0, PT, |R4|, |R14|, PT;                      /* 0xdc209c00071c121e */
                                                                                          /* 0x08b4acb4a010b810 */
    /*0348*/                   SEL R8, R10, R8, P0;                                       /* 0xe5000000041c2822 */
    /*0350*/                   SEL.S R9, R11, R9, P0;                                     /* 0xe500000004dc2c26 */
    /*0358*/                   MOV R8, RZ;                                                /* 0xe4c03c007f9c0022 */
    /*0360*/                   MUFU.RCP64H R9, R7;                                        /* 0x84000000031c1c26 */
    /*0368*/                   DSETP.GT.AND P0, PT, |R8|, RZ, PT;                         /* 0xdc201c007f9c221e */
    /*0370*/               @P0 BRA.U 0x398;                                               /* 0x120000001000023c */
    /*0378*/              @!P0 DSETP.NEU.AND P1, PT, |R6|, +INF , PT;                     /* 0xb4681fff80201a3d */
                                                                                          /* 0x0800b8a010ac0010 */
    /*0388*/              @!P0 SEL R9, R7, R9, P1;                                        /* 0xe500040004a01c26 */
    /*0390*/              @!P0 SEL R8, R6, RZ, P1;                                        /* 0xe50004007fa01822 */
    /*0398*/                   DMUL.S R8, R8, R4;                                         /* 0xe4000000025c2022 */
    /*03a0*/                   MOV R4, R8;                                                /* 0xe4c03c00041c0012 */
    /*03a8*/                   MOV R5, R9;                                                /* 0xe4c03c00049c0016 */
    /*03b0*/                   RET;                                                       /* 0x19000000001c003c */
    /*03b8*/                   BRA 0x3b8;                                                 /* 0x12007ffffc1c003c */

The MUFU.RCP64H instruction provides the initial approximation of the reciprocal. It operates on the high 32 bits of the denominator (y) and provides the high 32 bits of the double precision approximation, and therefor is counted as a Floating Point Operations (Single Precision Special) by the profiler. There is another single precision FFMA instruction further down apparently used as a high-throughput version of testing a conditional where full precision isn't required.

Presumable the single-precision comparisons `FSETP` are also counted for the profiler's count of single-precision instructions. — njuffa, Jan 16 '17 at 19:44
Yes - the profiler counts 400 single precision floating point *instructions*, but only 198 single precision floating point *operations*. — tera, Jan 16 '17 at 19:50

CUDA C using single precision flop on doubles

The problem

Minimal, Complete, and Verifiable example

What I've found so far

Edit - Working solution

1 Answers1