Why does `numpy.einsum` work faster with `float32` than `float16` or `uint16`?

Question

In my benchmark using numpy 1.12.0, calculating dot products with float32 ndarrays is much faster than the other data types:

In [3]: f16 = np.random.random((500000, 128)).astype('float16')
In [4]: f32 = np.random.random((500000, 128)).astype('float32')
In [5]: uint = np.random.randint(1, 60000, (500000, 128)).astype('uint16')

In [7]: %timeit np.einsum('ij,ij->i', f16, f16)
1 loop, best of 3: 320 ms per loop

In [8]: %timeit np.einsum('ij,ij->i', f32, f32)
The slowest run took 4.88 times longer than the fastest. This could mean that an intermediate result is being cached.
10 loops, best of 3: 19 ms per loop

In [9]: %timeit np.einsum('ij,ij->i', uint, uint)
10 loops, best of 3: 43.5 ms per loop

I've tried profiling einsum, but it just delegates all the computing to a C function, so I don't know what's the main reason for this performance difference.

It is likely that it has been compiled to work with float 32 and 64, and not the smaller ones. Modern operating systems and processors are 64 bit. — hpaulj, May 22 '17 at 02:49
I think it also has to do with the fact that numpy has to emulate `float16`. See [here](http://stackoverflow.com/questions/38975770/python-numpy-float16-datatype-operations-and-float8) — romeric, May 22 '17 at 02:57
One use case I can think of right now: if I have a dataset that doesn't really need `float32` precision, I can cut the memory consumption by half if `uint16` is as fast. — satoru, May 22 '17 at 03:01
There could be more than one issue going on here. Note that `einsum` uses some explicit SIMD intrinsics under the hood. FPUs are in general much faster and what you are doing (`double contraction`) is an exact application of `Fused Multiply-Add (FMA)`. There is no `FMA` for integers and in general integral arithmetics cannot achieve that much of a sustained throughput. But then the issue might as well be somewhere else. — romeric, May 22 '17 at 03:09
`f16` is slower than `f32` is other processes such as `dot` and `*`. In fact I had to kill the `f16` dot test. It's only with raw byte operations like `copy` that the smaller storage space of `f16` has an advantage. — hpaulj, May 22 '17 at 03:45

score 3 · Answer 1 · edited May 23 '17 at 12:02

My tests with your f16 and f32 arrays shows that f16 is 5-10x slower for all calculations. It's only when doing byte level operations like array copy does more compact nature of float16 show any speed advantage.

https://gcc.gnu.org/onlinedocs/gcc/Half-Precision.html

Is the section in the gcc docs about half floats, fp16. With the right processor and right compiler switches, it may possible to install numpy in way that speeds up these calculations. We'd also have to check if numpy .h files have any provision for special handling of half floats.

Earlier questions, may be good enough to be duplicate references

Python Numpy Data Types Performance

Python numpy float16 datatype operations, and float8?

Why does `numpy.einsum` work faster with `float32` than `float16` or `uint16`?

1 Answers1

Linked