5

I have a C/C++ program which involves intensive 32-bit floating-point matrix math computations such as addition, subtraction, multiplication, division, etc.

Can I speed up my program by converting 32-bit floating-point numbers into 16-bit fixed-point numbers ? How much speed gain can I get ?

Currently I'm working on a Intel I5 CPU. I'm using Openblas to perform the matrix calculations. How should I re-implement Openblas functions such as cblas_dgemm to perform fixed-point calculations ?

I know that SSE(Simple SIMD Extensions) operates on 4x32=8x16=128 bit data at one time, i.e., 4 32-bit floating-point type or 8 16-bit fixed-point type. I guess that after conversion from 32-bit floating-point to 16-bit fixed-point, my program would be twice faster.

Shawn
  • 619
  • 1
  • 5
  • 12
  • 4
    Unlikely, especially on Haswell and up with its floating-point FMA instructions, unless you have very specific usecases that can benefit from, e.g, `pmaddubsw` or `PMULHRSW`. – Iwillnotexist Idonotexist Sep 24 '16 at 15:03
  • At the risk of stating the obvious, are you able to access a GPU ? If so you may want to look at https://github.com/xianyi/clOpenBLAS – Ahmed Masud Sep 24 '16 at 15:32
  • @ Iwillnotexist, are you saying that 32-bit float type and 16bit fixed-point type are of roughly the same performance on a Intel 5 CPU? – Shawn Sep 24 '16 at 15:35
  • @Ahmed, I don't have access to GPU in my c/c++ program. Thank you for your pointer anyway. – Shawn Sep 24 '16 at 15:36
  • 2
    Hi Shawn, here's an interesting resource I ran into http://nicolas.limare.net/pro/notes/2014/12/12_arit_speed/; it references http://www.agner.org/optimize/ as well which may be of use; again sorry that it's not a direct answer and feels a bit obtuse :) – Ahmed Masud Sep 24 '16 at 15:46
  • Yet another interesting resource https://bitbucket.org/MDukhan/yeppp – Ahmed Masud Sep 24 '16 at 15:50
  • If memory bandwidth is your entire bottleneck, then using 1/2 sized data will make it 2x faster. It is very unlikely that memory bandwidth is the only limiting factor. Using more CPUs is likely to speed up your program more, and will scale up on machines with more processors, and you won't be losing as much precision and range as you would with 16 bit values. You haven't provided enough information to be sure whether your problem can be easily solved in parallel. – doug65536 Sep 24 '16 at 16:20
  • @doug6556, my program will run on a single core of a Intel CPU, and my program cannot be parallelized to run on either a GPU or multiple cores of a CPU. – Shawn Sep 25 '16 at 03:35
  • 2
    I can confirm by looking at Agner Fog's tables that on Haswell+, float is the way to go. Haswell can sustain 32 FLOPs/cycle/core using 2 eight-element vector FMAs/cycle/core, but can only sustain 1 pmaddubsw or pmulhrsw + 1 paddw/cycle/core (2 sixteen-element vector operations = 32 16-bit integer ops/cycle/core total). So you get far more precision and much less complexity by just using OpenBLAS. – Iwillnotexist Idonotexist Sep 25 '16 at 03:42
  • @IwillnotexistIdonotexist, thank you very much for your reply. Do you have any suggestions that I can speed up my floating-point computation program? – Shawn Sep 25 '16 at 04:07
  • I can't really do that without you showing us your code as it currently stands, but _in general_ the most compute-dense part of a floating-point program is usually something that can be assimilated to a matrix-matrix multiplication (for which the BLAS libraries exist, OpenBLAS being one of the better ones), or a matrix factorization (for that there's LAPACK, which uses BLAS, or Eigen). On Haswell, make sure the FMA instructions are being used. – Iwillnotexist Idonotexist Sep 25 '16 at 04:31
  • @IwillnotexistIdonotexist, In my code, I'm basically trying to speed up the Caffe deep learning library by converting 32-bit floating-point to 16-bit fixed-point. The Caffe deep learning library is making use of a lot of cblas_sgemm to perform matrix-matrix product calculation. – Shawn Sep 25 '16 at 11:51
  • 2
    Ah, so you're a fellow machine learner! You could have volunteered that up-front (!). Is this a CNN? If so, consider looking at [Winograd convolution](https://arxiv.org/abs/1509.09308) and [Intel's DNN package](https://software.intel.com/en-us/articles/deep-neural-network-technical-preview-for-intel-math-kernel-library-intel-mkl). Other than that, fixed-point is an active research area, but Intel processors are ill-suited to take advantage. As doug said, the only use-case where it trully benefits Intel CPUs is if you are exclusively memory-bottlenecked, and even then only by <2x. Not killer. – Iwillnotexist Idonotexist Sep 25 '16 at 16:49

1 Answers1

7

Summary: Modern FPU hardware is hard to beat with fixed-point, even if you have twice as many elements per vector.

Modern BLAS library are typically very well tuned for cache performance (with cache blocking / loop tiling) as well as for instruction throughput. That makes them very hard to beat. Especially DGEMM has lots of room for this kind of optimization, because it does O(N^3) work on O(N^2) data, so it's worth transposing just a cache-sized chunk of one input, and stuff like that.

What might help is reducing memory bottlenecks by storing your floats in 16-bit half-float format. There is no hardware support for doing math on them in that format, just a couple instructions to convert between that format and normal 32-bit element float vectors while loading/storing: VCVTPH2PS (__m256 _mm256_cvtph_ps(__m128i)) and VCVTPS2PH (__m128i _mm256_cvtps_ph(__m256 m1, const int imm8_rounding_control). These two instructions comprise the F16C extension, first supported by AMD Bulldozer and Intel IvyBridge.

IDK if any BLAS libraries support that format.


Fixed point:

SSE/AVX doesn't have any integer division instructions. If you're only dividing by constants, you might not need a real div instruction, though. So that's one major stumbling block for fixed point.

Another big downside of fixed point is the extra cost of shifting to correct the position of the decimal (binary?) point after multiplies. That will eat into any gain you could get from having twice as many elements per vector with 16-bit fixed point.

SSE/AVX actually has quite a good selection of packed 16-bit multiplies (better than for any other element size). There's packed multiply producing the low half, high half (signed or unsigned), and even one that takes 16 bits from 2 bits below the top, with rounding (PMULHRSW.html). Skylake runs those at two per clock, with 5 cycle latency. There are also integer multiply-add instructions, but they do a horizontal add between pairs of multiply results. (See Agner Fog's insn tables, and also the tag wiki for performance links.) Haswell and previous don't have as many integer-vector add and multiply execution units. Often code bottlenecks on total uop throughput, not on a specific execution port anyway. (But a good BLAS library might even have hand-tuned asm.)

If your inputs and outputs are integer, it's often faster to work with integer vectors, instead of converting to floats. (e.g. see my answer on Scaling byte pixel values (y=ax+b) with SSE2 (as floats)?, where I used 16-bit fixed-point to deal with 8-bit integers).

But if you're really working with floats, and have a lot of multiplying and dividing to do, just use the hardware FPUs. They're amazingly powerful in modern CPUs, and have made fixed-point mostly obsolete for many tasks. As @Iwill points out, FMA instructions are another big boost for FP throughput (and sometimes latency).


Integer add/subtract/compare instructions (but not multiply) are also lower latency than their FP counterparts.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 3
    Integer add/sub _reductions_ are not just lower-latency but higher throughput! On Haswell you can sustain two `padd*` on ports 1&5 and two vmovaps on ports 2&3, per CC. At 16-bit fixed-point, that works out to 32 adds/CC, while float addition, even implemented with FMA `d = a*1.0+c`, can only reach 16 adds/CC. For bonus points, the `padd*` can be made saturating at no penalty. – Iwillnotexist Idonotexist Sep 25 '16 at 05:32
  • 2
    @IwillnotexistIdonotexist: yeah, things are great for add/sub, you get the full benefit of twice as many elements per vector. – Peter Cordes Sep 25 '16 at 05:39