The question suggests that it is asked in the context of machine learning, and that therefore the focus is on single-precision computation, and in particular using the IEEE-754 binary32
format. Asker states that NVIDIA GPUs are a platform of interest. I will focus on use of these GPUs using CUDA, since I am not familiar with Python bindings for CUDA.
When speaking of FLOPS, there are various schools of thought on how to count them beyond simple additions and multiplications. GPUs for example compute divisions and square roots in software. It is less ambiguous to identify floating-point instructions and count those, which is what I will do here. Note that not all floating-point instruction will execute with the same throughput, and that this can also depend on GPU architecture. Some relevant information about instruction throughputs can be found in the CUDA Programming Guide.
Starting with the Turing architecture (compute capability 7.5), GPUs include an instruction MUFU.TANH
to compute a single-precision hyperbolic tangent with about 16 bits of accuracy. The single-precision functions supported by the multi-function unit (MUFU) are generally computed via quadratic interpolation in tables stored in ROM. Best I can tell, MUFU.TANH
is exposed at the level of the virtual assembly language PTX, but not (as of CUDA 11.2) as a device function intrinsic.
But given that the functionality is exposed at the PTX level, we can easily create out own intrinsic with one line of inline assembly:
// Compute hyperbolic tangent for >= sm75. maxulperr = 133.95290, maxrelerr = 1.1126e-5
__forceinline__ __device__ float __tanhf (float a)
{
asm ("tanh.approx.f32 %0,%1; \n\t" : "=f"(a) : "f"(a));
return a;
}
On older GPU architectures with compute capability < 7.5, we can implement the intrinsic with very closely matching characteristics by algebraic transformation and use of the machine instructions MUFU.EX2
and MUFU.RCP
, which compute the exponential base 2 and the reciprocal, respectively. For arguments small in magnitude we can use tanh(x) = x and experimentally determine a good switchover point between the two approximations.
// like copysignf(); when first argument is known to be positive
__forceinline__ __device__ float copysignf_pos (float a, float b)
{
return __int_as_float (__float_as_int (a) | (__float_as_int (b) & 0x80000000));
}
// Compute hyperbolic tangent for < sm_75. maxulperr = 108.82848, maxrelerr = 9.3450e-6
__forceinline__ __device__ float __tanhf (float a)
{
const float L2E = 1.442695041f;
float e, r, s, t, d;
s = fabsf (a);
t = -L2E * 2.0f * s;
asm ("ex2.approx.ftz.f32 %0,%1;\n\t" : "=f"(e) : "f"(t));
d = e + 1.0f;
asm ("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(d));
r = fmaf (e, -r, r);
if (s < 4.997253418e-3f) r = a;
if (!isnan (a)) r = copysignf_pos (r, a);
return r;
}
Compiling this code with CUDA 11.2 for an sm_70
target and then disassembling the binary with cuobjdump --dump-sass
shows eight floating-point instructions. We can also see that the resulting machine code (SASS) is branchless.
If we desire a hyperbolic tangent with full single-precision accuracy, we can use a minimax polynomial approximation for arguments small in magnitude, while using algebraic transformation and the machine instructions MUFU.EX2
and MUFU.RCP
for arguments larger in magnitude. Beyond a certain magnitude of the argument, the result will be ±1.
// Compute hyperbolic tangent. maxulperr = 1.81484, maxrelerr = 1.9547e-7
__forceinline__ __device__ float my_tanhf (float a)
{
const float L2E = 1.442695041f;
float p, s, t, r;
t = fabsf (a);
if (t >= 307.0f/512.0f) { // 0.599609375
r = L2E * 2.0f * t;
asm ("ex2.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(r));
r = 1.0f + r;
asm ("rcp.approx.ftz.f32 %0,%1;\n\t" : "=f"(r) : "f"(r));
r = fmaf (r, -2.0f, 1.0f);
if (t >= 9.03125f) r = 1.0f;
r = copysignf_pos (r, a);
} else {
s = a * a;
p = 1.57394409e-2f; // 0x1.01e000p-6
p = fmaf (p, s, -5.23025580e-2f); // -0x1.ac766ap-5
p = fmaf (p, s, 1.33152470e-1f); // 0x1.10b23ep-3
p = fmaf (p, s, -3.33327681e-1f); // -0x1.5553dap-2
p = fmaf (p, s, 0.0f);
r = fmaf (p, a, a);
}
return r;
}
This code contains a data-dependent branch, and a look at the machine code generated by CUDA 11.2 for an sm75
target shows that the branch is retained. This means that in general, across all active threads, some will follow one side of the branch while the remainder follows the other side of the branch, requiring subsequent synchronization. So to get a realistic idea about computational effort required we need to combine the count of floating-point instructions for both execution paths. This comes out to thirteen floating-point instructions.
The error bounds in the code comments above were established by exhaustive tests against with all possible single-precision arguments.