Can anyone provide sample code demonstrating the use of 16 bit floating point in cuda?

Question

Cuda 7.5 supports 16 bit floating point variables. Can anyone provide sample code demonstrating the use of it?

Why on earth would anyone ever want to do floating point calculations with anything other than a 64-bit double? What use case is good with less precision? — duffymo, Sep 23 '15 at 12:14
When you require more variables in a program using same memory. Also, there are cases where the precision of 2 byte float is enough. Not all cases are the same — Roshan, Sep 23 '15 at 12:29
Memory is cheap; buy more. No scientific calculation will thank you for less precision. — duffymo, Sep 23 '15 at 12:30
@duffymo - 16-bit is fine for many applications (eg. computer vision). It's also a lot faster. There are also no single GPUs which can access more than 12GB at the moment, so "buy more" is not going to solve all problems. — Jez, Sep 23 '15 at 14:21
@duffymo Wherever precision is less important than other things: Computer Graphics, Evolutionary algorithms,.. plenty of stuff. Comparing to doubles, 4 times as much halves fits on a cache line or could be fetched in one shot from gmem. Thus, performance gains and bandwidth gains. Really strange to see unilateral comments like this from 200k+ member. — Ivan Aksamentov - Drop, Sep 24 '15 at 03:21
I made a living working on problems where precision mattered. Not so strange. — duffymo, Sep 24 '15 at 09:01
@duffymo It's been argued that there are advantages other than speed and caching for using half precision computations. The almost pseudo-randomness of representable floating point numbers can introduce a form of noise when training and working with learning models. This, as well as an arguement that Machine-learning is solving the wrong problem anyway, actually allows greater performance and better results. Also in optimisation, e.g. root finding, we can better performance by starting with lower precision and as we head to convergence, increase the precision we use. — Joel Biffin, Apr 07 '19 at 23:53

Robert Crovella · Accepted Answer · 2021-12-01T00:31:17.343

There are a few things to note up-front:

Refer to the half-precision intrinsics.
Note that many of these intrinsics are only supported in device code. However, in recent/current CUDA versions, many/most of the conversion intrinsics are supported in both host and device code. (And, @njuffa has created a set of host-usable conversion functions here) Therefore, even though the code sample below shows conversion in device code, the same types of conversions and intrinsics (half->float, float->half) are usable and supported in host code in the same way.
Note that devices of compute capability 5.2 and below do not natively support half-precision arithmetic. This means that any arithmetic operations to be performed must be done on some supported type, such as float. Devices of compute capability 5.3 (Tegra TX1, currently) and presumably future devices, will support "native" half-precision arithmetic operations, but these are currently exposed through such intrinsics as __hmul. An intrinsic like __hmul will be undefined in devices that do not support native operations.
You should include cuda_fp16.h in any file where you intend to make use of these types and intrinsics in device code.
The half2 data type (a vector type) is really the preferred form for condensed/bulk half storage (such as in a vector or matrix), so you may want to use the relevanthalf2 conversion functions.

With the above points in mind, here is a simple code that takes a set of float quantities, converts them to half quantities, and scales them by a scale factor:

$ cat t924.cu
#include <stdio.h>
#include <cuda_fp16.h>
#define DSIZE 4
#define SCF 0.5f
#define nTPB 256
__global__ void half_scale_kernel(float *din, float *dout, int dsize){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < dsize){
    half scf = __float2half(SCF);
    half kin = __float2half(din[idx]);
    half kout;
#if __CUDA_ARCH__ >= 530
    kout = __hmul(kin, scf);
#else
    kout = __float2half(__half2float(kin)*__half2float(scf));
#endif
    dout[idx] = __half2float(kout);
    }
}

int main(){

  float *hin, *hout, *din, *dout;
  hin  = (float *)malloc(DSIZE*sizeof(float));
  hout = (float *)malloc(DSIZE*sizeof(float));
  for (int i = 0; i < DSIZE; i++) hin[i] = i;
  cudaMalloc(&din,  DSIZE*sizeof(float));
  cudaMalloc(&dout, DSIZE*sizeof(float));
  cudaMemcpy(din, hin, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
  half_scale_kernel<<<(DSIZE+nTPB-1)/nTPB,nTPB>>>(din, dout, DSIZE);
  cudaMemcpy(hout, dout, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
  for (int i = 0; i < DSIZE; i++) printf("%f\n", hout[i]);
  return 0;
}

$ nvcc -o t924 t924.cu
$ cuda-memcheck ./t924
========= CUDA-MEMCHECK
0.000000
0.500000
1.000000
1.500000
========= ERROR SUMMARY: 0 errors
$

If you study the above code, you'll note that, except in the case of cc5.3 and higher devices, the arithmetic is being done as a regular float operation. This is consistent with the note 3 above.

The takeaways are as follows:

On devices of cc5.2 and below, the half datatype may still be useful, but principally as a storage optimization (and, relatedly, perhaps a memory bandwidth optimization, since e.g. a given 128-bit vector load could load 8 half quantities at once). For example, if you have a large neural network, and you've determined that the weights can tolerate being stored as half-precision quantities (thereby doubling the storage density, or approximately doubling the size of the neural network that can be represented in the storage space of a GPU), then you could store the neural network weights as half-precision. Then, when you need to perform a forward pass (inference) or a backward pass (training) you could load the weights in from memory, convert them on-the-fly (using the intrinsics) to float quantities, perform the necessary operation (perhaps including adjusting the weight due to training), then (if necessary) store the weight again as a half quantity.
For cc5.3 and future devices, if the algorithm will tolerate it, it may be possible to perform a similar operation as above, but without conversion to float (and perhaps back to half), but rather leaving all data in half representation, and doing the necessary arithmetic directly (using e.g. __hmul or __hadd intrinsics).

Although I haven't demonstrated it here, the half datatype is "usable" in host code. By that, I mean you can allocate storage for items of that type, and perform e.g. cudaMemcpy operations on it. But the host code doesn't know anything about half data type (e.g. how to do arithmetic on it, or print it out) and for example the arithmetic intrinsics are not usable in host code. Therefore, you could certainly allocate storage for a large array of half (or probably half2) data type if you wanted to (perhaps to store a set of neural network weights), but you could only directly manipulate that data with any ease from device code, not host code.

A few more comments:

The CUBLAS library implements a matrix-matrix multiply designed to work directly on half data. The description above should give some insight as to what is likely going on "under the hood" for different device types (i.e. compute capabilities).
A related question about use of half in thrust is here.

so in the current context, we cannot use cudaMalloc for a "half" variable? or cudaMemcpy to copy a half variable in host(created using libraries) to a device half variable? isn't it? — Roshan, Sep 29 '15 at 02:44
I specifically said in my answer that you could use cudaMalloc and cudaMemcpy with half datatype — Robert Crovella, Sep 29 '15 at 11:28

Can anyone provide sample code demonstrating the use of 16 bit floating point in cuda?

1 Answers1

Linked

Related