I used NumPy to test the differences in execution times of vectorized arithmetic operations on integer arrays of different integer widths. I create 8-bit, 16-bit, 32-bit and 64-bit integer arrays with 100 million random elements each, and then multiply each array by the number 7. What is the reason (or what are the reasons, if there is more than one) that computations on smaller-width integer arrays are faster compared to larger-width ones? Also, what is the reason that computations on 8-bit integer arrays are about 4 times faster than those on 16-bit integer arrays, but those on 16-bit integer arrays are only about 2 times faster than those on 32-bit integer arrays. Computations on 32-bit integer arrays are also only about 2 times faster than those on 64-bit integer arrays.
Here is the code I tried:
import numpy as np
np.random.seed(200)
arr_int8 = np.array(np.random.randint(10, size=int(1e8)), dtype=np.int8)
np.random.seed(200)
arr_int16 = np.array(np.random.randint(10, size=int(1e8)), dtype=np.int16)
np.random.seed(200)
arr_int32 = np.array(np.random.randint(10, size=int(1e8)), dtype=np.int32)
np.random.seed(200)
arr_int64 = np.array(np.random.randint(10, size=int(1e8)), dtype=np.int64)
%%timeit
arr_int8_mult = arr_int8*7
# 28.5 ms ± 4.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
arr_int16_mult = arr_int16*7
# 124 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
arr_int32_mult = arr_int32*7
# 250 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
arr_int64_mult = arr_int64*7
# 533 ms ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
One time (but not always), I also got the following message when benchmarking arr_int8_mult: "The slowest run took 5.97 times longer than the fastest. This could mean that an intermediate result is being cached."
I'm not fully sure why there is a speedup in going from 32-bit to 16-bit, and even more speedup in going from 16-bit to 8-bit. My initial guess was that more number of smaller-width integers can be packed into a fixed-width register (compared to larger-width integers). But then it doesn't explain why 8-bit integers are special and have double the naively-expected performance boost of 2x. Another reason might be that the results are being cached, but I'm not sure how it is actually happening in practice (if it is even true). The performance speedups are roughly 4x, 2x and 2x consistently, so there must be a straightforward explanation for them?
Specifications:
Processor: x86-64 Intel i5-5250U CPU, Broadwell with 3MiB of L3 cache.
RAM: 8 GB
NumPy version: 1.24.3
Python version: 3.11.4
Operating System: MacOS Monterey 12.6.7