4

NumPY has complex64 corresponding to two float32's.

But it also has float16's but no complex32.

How come? I have signal processing calculation involving FFT's where I think I'd be fine with complex32, but I don't see how to get there. In particular I was hoping for speedup on NVidia GPU with cupy.

However it seems that float16 is slower on GPU rather than faster.

Why is half-precision unsupported and/or overlooked?

Also related is why we don't have complex integers, as this may also present an opportunity for speedup.

Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
Lars Ericson
  • 1,952
  • 4
  • 32
  • 45
  • Why were you expecting a speedup? – hpaulj Jun 26 '19 at 19:43
  • 2
    Because half the bits to push around. – Lars Ericson Jun 27 '19 at 00:28
  • but what if the processor (and `c` code) is optimized for 32 and 64 bit processing? Most of us aren't using 8 bit processors any more! – hpaulj Jun 27 '19 at 00:42
  • You can torture a late model NVidia GPU into doing it, and signal processing can be quite slow so there may be pain points where it's worth it. https://docs.nvidia.com/cuda/cufft/index.html#half-precision-transforms – Lars Ericson Jun 27 '19 at 01:11
  • There was another recent SO question about float16 - specifically with respect to `pandas`. On a matrix multiplication example I was just testing, float16 was 1 to 2 orders of magnitude slower than float64. – hpaulj Jun 27 '19 at 02:58
  • It might be due to the condition number of dft. See https://www2.stetson.edu/~efriedma/research/wymer.pdf : the condition number for the 2-norm and infinite norm are sqrt(n) and n, where n is the length of the signal. Let's look at the DFT of an 8x8 image. Since float16 are 3.3 digit precise, the 2-norm of the transformed image would only be about 2 digit precise and the pixelwise precision of the output is less than 2 digits. See https://hal.archives-ouvertes.fr/hal-01837982/file/A_Study_on_Convolution_Using_Half_Precision_Floating_Point_Numbers_on_GPU_for_Radio_Astronomy_Deconvolution.pdf – francis Jun 27 '19 at 16:26
  • 1
    With respect to what cupy has or has not implemented, that's probably just a matter of development priority. cupy is still pretty new (e.g. at least compared to CUDA, or numpy, for example). You might express your desire to the cupy developers, in the form of an issue or pull request. I doubt asking a random question on SO is a good way to indicate to the cupy development team your interest. A better way would be to contact them directly (github, for example) and provide a specific example, and maybe even a specific genre, for motivation. – Robert Crovella Jun 29 '19 at 15:04
  • 1
    `However it seems that float16 is slower on GPU rather than faster.` Its certainly possible for a FP16 FFT on a GPU to be faster than a corrsponding F32 (or FP64) FFT. GPU type matters, of course. It also seems like you may have pointed this out in an oblique fashion in your comments, so I'm not sure why you would leave your statement like that in your question unedited. So I'll just leave this here for future readers. – Robert Crovella Jun 29 '19 at 15:22

1 Answers1

2

This issue has been raised in the CuPy repo for some time:

https://github.com/cupy/cupy/issues/3370

But there's no concrete work plan yet; most of the things are still of explorative nature.

One of the reasons that it's not trivial to work out is that there's no numpy.complex32 dtype that we can directly import (note that all CuPy's dtypes are just alias of NumPy's), so there'd be problems when a device-host transfer is asked. The other thing is there's no native mathematical functions written either on CPU or GPU for complex32, so we will need to write them all ourselves to do casting, ufunc, and what not. In the linked issue there is a link to a NumPy discussion, and my impression is it's currently not being considered...

Leo Fang
  • 773
  • 5
  • 12
  • I would like to add, though, during the preliminary testing to support half-precision FFT in CuPy (https://github.com/cupy/cupy/pull/4407), we do see that an expected 2x speedup can be obtained on certain architectures. @RobertCrovella It would be great if you could help us understand better why Pascal is not performant there – Leo Fang Dec 07 '20 at 03:41