Does there exist a way to do small fixed range multiplication efficiently?

Question

Suppose I have an inline function:

inline int mul(short x, short y) {
    return (int)x * (int)y;
}

Here y is in {1,2,...,32}, and x is in {-4,-3,-2,-1,0,1,...,8192}. Considering y is in a very small range, does there exist a way to speed up mul()?

Background: this code is extracted from a scientific computing program written in C/C++, and profiling has shown that the above function consumes over 10% CPU time of the whole program since it is called very frequently. Therefore, I would like to try to figure out a way to speed it up.

Thank you :)

It is likely that the multiplication is going to get compiled down to a single CPU instruction. As such, no further optimization is likely to happen. — Sam Varshavchik, Dec 06 '15 at 03:04
Maybe it's possible to reduce the number of calls to this instead of reducing the cost of the call itself? — Kevin, Dec 06 '15 at 03:20
Looking at the generated assembly would help, and you can make sure that all calls to the function truly gets inlined, `inline` doesn't guarantee this. — nnn, Dec 06 '15 at 03:28
Changing to: `inline int mul(int x, int y) { return x * y;}` could also help. — nnn, Dec 06 '15 at 03:34
I think we're looking at this too much with a microscope. If we zoom out a bit, you might find opportunities to vectorize the arithmetic, improve locality of reference, etc. Just staring at a simple function which is almost certainly going to be inlined to multiply two ints is going to make it really difficult to optimize -- too little info. — , Dec 06 '15 at 03:42
10% is probably not a bottleneck. Also, arithmetic instructions are a bottleneck only in very well tuned applications. Virtually all modern apps are bound by memory accesses. You should really *look towards memory* first. As of arithmetics, you might *vectorize*, *parallelize* or to *improve algorithms* to take less multiplications (i.e. to call `mul` less frequently). — Ivan Aksamentov - Drop, Dec 06 '15 at 05:56
If profiling shows this is taking 10% of the CPU time, clearly it isn't being inlined at all. Hard to understand why this is a function at all. I would just remove it, or turn it into a macro. Nothing wrong with writing `x*y` in the code. — user207421, Dec 06 '15 at 06:09

Don Reba · Accepted Answer · 2015-12-06T05:40:08.927

Intel's SSE4 intrinsics provide the data type __m128i, which can hold 4 32-bit values.

__m128i _mm_mullo_epi32(__m128i a, __m128i b)

Packed integer 32-bit multiplication with truncation of upper halves of results.

Reference

You can perform 4 multiplications at a time. Since you know that your data range is limited, truncation won't be a problem. You could also use single-precision floating point and the older mulps intrinsic.

Besides, it might be a good idea to analyze your program with a profiler like VTune and see if you are suffering from excessive cache misses, aliasing, or alignment problems.

Does there exist a way to do small fixed range multiplication efficiently?

1 Answers1