What is the proper way to print an FP16 element in a cuda kernel without casting it to other formats?

Question

I wrote a simple kernel program that uses printf to log some values after computation. It works perfectly with types like integer other than fp16. It incorrectly prints a zero value. The problem seems to be due to the fact that the print format is not correct. Which format should I use instead without casting it to other data types e.g. float,double?


#include <stdio.h>
#include <cuda_fp16.h>

#define DTYPE half //int
#define PRINT_FMT "output %f\n" //"output %d\n"

__global__ void __launch_bounds__(1024) test_print_kernel(DTYPE *__restrict__ O)
{
    // printf("test kernel\n");
    if (((int)blockIdx.x == 0) && ((int)threadIdx.x == 0))
    {
        O[0] = ((DTYPE)(2));
        __syncthreads();

        printf(PRINT_FMT, O[0]);

    }
}

int main(int argc, char **argv)
{
    DTYPE *h_O;
    cudaStream_t stream;
    cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
    cudaMallocHost(&h_O, 1 * sizeof(DTYPE));
    test_print_kernel<<<dim3(1, 1, 1), dim3(1, 1, 1), 0, (cudaStream_t)stream>>>(h_O);
    cudaDeviceSynchronize();
}

The warning I got during compilation: test_printf.cu(16): warning #1290-D: a class type that is not trivially copyable passed through ellipsis

test_printf.cu(16): warning #181-D: argument is incompatible with corresponding format string conversion

And it prints to the terminal: output 0.000000

There [isn't](https://stackoverflow.com/a/70673073/1695960) a `printf` format specifically for fp16. [convert](https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF__MISC.html#group__CUDA__MATH____HALF__MISC_1g0b79d92cb1fd7012b9c4416e9f4a03ba) it to `float` before using with `printf`. Since `printf` in kernel is not part of any high performance code path (usually used for debugging) this shouldn't present an onerous problem. — Robert Crovella, Jun 14 '23 at 17:56

score 1 · Accepted Answer · answered Jun 18 '23 at 05:00

There is no output format specifier for half precision floating point values in either the C++ standard definition of printf, or in the CUDA implementation.

Your only real choice is to convert the half value to a float and use the standard float format specifier. The CUDA Math API has an instrinic function for this so something like:

printf(“output %f\n”, __half2float(O[0]));

should work at a minimum. As pointed out in comments, there is a performance penalty for the conversion, but this is printf, which has a huge performance penalty anyway.

Note that this will make the kernel specific to the half case. If you genuinely need a generic kernel for different types with output, there are variadic template solutions which you could use to generalise this to different types, for example as discussed here.

What is the proper way to print an FP16 element in a cuda kernel without casting it to other formats?

1 Answers1