3

Suppose I have an inline function:

inline int mul(short x, short y) {
    return (int)x * (int)y;
}

Here y is in {1,2,...,32}, and x is in {-4,-3,-2,-1,0,1,...,8192}. Considering y is in a very small range, does there exist a way to speed up mul()?

Background: this code is extracted from a scientific computing program written in C/C++, and profiling has shown that the above function consumes over 10% CPU time of the whole program since it is called very frequently. Therefore, I would like to try to figure out a way to speed it up.

Thank you :)

ACcreator
  • 1,362
  • 1
  • 12
  • 29
  • 2
    It is likely that the multiplication is going to get compiled down to a single CPU instruction. As such, no further optimization is likely to happen. – Sam Varshavchik Dec 06 '15 at 03:04
  • Maybe it's possible to reduce the number of calls to this instead of reducing the cost of the call itself? – Kevin Dec 06 '15 at 03:20
  • That's right, but that would be another question. – ACcreator Dec 06 '15 at 03:25
  • 1
    Looking at the generated assembly would help, and you can make sure that all calls to the function truly gets inlined, `inline` doesn't guarantee this. – nnn Dec 06 '15 at 03:28
  • 1
    Changing to: `inline int mul(int x, int y) { return x * y;}` could also help. – nnn Dec 06 '15 at 03:34
  • 1
    I think we're looking at this too much with a microscope. If we zoom out a bit, you might find opportunities to vectorize the arithmetic, improve locality of reference, etc. Just staring at a simple function which is almost certainly going to be inlined to multiply two ints is going to make it really difficult to optimize -- too little info. –  Dec 06 '15 at 03:42
  • What processor are you using? – chux - Reinstate Monica Dec 06 '15 at 05:10
  • @chux intel Haswell & Skylake – ACcreator Dec 06 '15 at 05:14
  • 10% is probably not a bottleneck. Also, arithmetic instructions are a bottleneck only in very well tuned applications. Virtually all modern apps are bound by memory accesses. You should really *look towards memory* first. As of arithmetics, you might *vectorize*, *parallelize* or to *improve algorithms* to take less multiplications (i.e. to call `mul` less frequently). – Ivan Aksamentov - Drop Dec 06 '15 at 05:56
  • If profiling shows this is taking 10% of the CPU time, clearly it isn't being inlined at all. Hard to understand why this is a function at all. I would just remove it, or turn it into a macro. Nothing wrong with writing `x*y` in the code. – user207421 Dec 06 '15 at 06:09

1 Answers1

2

Intel's SSE4 intrinsics provide the data type __m128i, which can hold 4 32-bit values.

__m128i _mm_mullo_epi32(__m128i a, __m128i b)

Packed integer 32-bit multiplication with truncation of upper halves of results.

Reference

You can perform 4 multiplications at a time. Since you know that your data range is limited, truncation won't be a problem. You could also use single-precision floating point and the older mulps intrinsic.

Besides, it might be a good idea to analyze your program with a profiler like VTune and see if you are suffering from excessive cache misses, aliasing, or alignment problems.

Don Reba
  • 13,814
  • 3
  • 48
  • 61