Why are vectorized computations on integer arrays faster if a smaller-width integer type is used?

Question

I used NumPy to test the differences in execution times of vectorized arithmetic operations on integer arrays of different integer widths. I create 8-bit, 16-bit, 32-bit and 64-bit integer arrays with 100 million random elements each, and then multiply each array by the number 7. What is the reason (or what are the reasons, if there is more than one) that computations on smaller-width integer arrays are faster compared to larger-width ones? Also, what is the reason that computations on 8-bit integer arrays are about 4 times faster than those on 16-bit integer arrays, but those on 16-bit integer arrays are only about 2 times faster than those on 32-bit integer arrays. Computations on 32-bit integer arrays are also only about 2 times faster than those on 64-bit integer arrays.

Here is the code I tried:

import numpy as np

np.random.seed(200)
arr_int8 = np.array(np.random.randint(10, size=int(1e8)), dtype=np.int8)

np.random.seed(200)
arr_int16 = np.array(np.random.randint(10, size=int(1e8)), dtype=np.int16)

np.random.seed(200)
arr_int32 = np.array(np.random.randint(10, size=int(1e8)), dtype=np.int32)

np.random.seed(200)
arr_int64 = np.array(np.random.randint(10, size=int(1e8)), dtype=np.int64)

%%timeit
arr_int8_mult = arr_int8*7
# 28.5 ms ± 4.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
arr_int16_mult = arr_int16*7
# 124 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
arr_int32_mult = arr_int32*7
# 250 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
arr_int64_mult = arr_int64*7
# 533 ms ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

One time (but not always), I also got the following message when benchmarking arr_int8_mult: "The slowest run took 5.97 times longer than the fastest. This could mean that an intermediate result is being cached."

I'm not fully sure why there is a speedup in going from 32-bit to 16-bit, and even more speedup in going from 16-bit to 8-bit. My initial guess was that more number of smaller-width integers can be packed into a fixed-width register (compared to larger-width integers). But then it doesn't explain why 8-bit integers are special and have double the naively-expected performance boost of 2x. Another reason might be that the results are being cached, but I'm not sure how it is actually happening in practice (if it is even true). The performance speedups are roughly 4x, 2x and 2x consistently, so there must be a straightforward explanation for them?

Specifications:

Processor: x86-64 Intel i5-5250U CPU, Broadwell with 3MiB of L3 cache.

RAM: 8 GB

NumPy version: 1.24.3

Python version: 3.11.4

Operating System: MacOS Monterey 12.6.7

An extra Fun Fact about AVX2 (which you're probably implicitly using through numpy) is that it doesn't have a nice way to multiply bytes. Based on that it should if anything have been *less* than twice as fast, but well.. — harold, Aug 20 '23 at 10:35
... but well this benchmark indicates that we're purely limited by memory bandwidth, not computation, as usual for such low computational intensity (amount of work per time that data is pulled into registers, or into L1d cache). Note that you're keeping the element count constant, not the size in bytes. SIMD operations process 32 bytes per instruction, regardless of how many elements that is, although for some sizes it'll have to shift and subtract since multiply instructions aren't available for all SIMD element sizes on x86. (`x*7 = (x<<4) - x`). Other than that, SIMD is ~const bandwidth — Peter Cordes, Aug 20 '23 at 15:31
The 4x speed difference between int8 and int16 doesn't make sense to me, either. Is this consistent with other sizes, e.g. smaller ones like one MiB so input/output can stay hot in L3 cache? Or even 100 KiB (less than half L2 size)? Your array size of 95.36 MiB (for int8) is a lot bigger than L3 cache on most desktop CPUs, so probably it's not helping. Unless you have an AMD Zen with 3D cache for 96 MiB of L3? Are you using an x86 CPU at all, rather than Apple M1 or something? What CPU model number? — Peter Cordes, Aug 20 '23 at 15:37
On my machine, I get 62 ms, 74 ms, 98 ms and 160 ms using Numpy 1.24.3 on a i5-9600KF CPU (and a 40 GiB/s RAM). The version of Numpy matters a lot since we have optimized that quite recently. The OS also matters because of *page-faults* and *huge-pages*. To remove the overhead of page-faults, you can use `np.multiply(inArray, 7, out=outArray)` and preallocate `outArray` (with the matching output type). With that, I get : 50 ms, 51 ms, 52 ms, 74 ms. I guess only the last one is memory bound and not the previous and I do not have the SIMD optimized version (too old Numpy). Can you try that? — Jérôme Richard, Aug 20 '23 at 16:59
Besides, the 3 last runs seems really slow to be memory-bound unless you have a low-end RAM. They should *not* be compute-bound regarding your CPU. For example, the last run reach 2.8 GiB/s (and 4.2 GiB/s if we consider write allocates). This is far lower than what you CPU can reach: 25 GiB/s, even with 1 memory channel, unless it also have a very low latency... I think the computations are bound by the OS : either the page-fault kernel thread on Windows, or you are running out of RAM then the OS do swapping / data compression. This would explain why the first is much faster. — Jérôme Richard, Aug 20 '23 at 17:07
@JérômeRichard: I'd hoped `timeit`'s re-runs would take care of warming up the destination array so it's not page-faulting freshly allocated memory every time, or I'd have mentioned page faults (and linked [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) for more details). But I guess that's not the case; well spotted that the bandwidth math is below what even a Broadwell laptop should do, so probably page-faults. Can numpy operate in-place to further reduce memory bandwidth requirements and avoid having to allocate + pre-fault an output buffer? — Peter Cordes, Aug 20 '23 at 18:03
@PeterCordes AFAIK, If the memory is given back to the OS (once freed), then there will be page faults and OS will zeroize memory again. Whether memory is given back to the OS or not depend of the behaviour of the Python allocator and the standard library which tends to give memory back to the OS (at least on Windows, IDK on MacOS). Numpy indeed supports in-place operations. One just need to replace `outArray` by `inArray` in the code provided above. This mainly remove write allocates so it should be 1.5x faster. On my machine, results in this case are: 29 ms, 32 ms, 38 ms, 53 ms. — Jérôme Richard, Aug 20 '23 at 18:31
@PeterCordes Note that the allocated array is pretty large so this makes sense for memory to be given back to the OS : for 64-bit integers, the input array takes 0.8 GiB and the output array takes the same space, that is 1.6 GiB of RAM (not to mention the other allocated arrays takes 0.8 GiB so 2.4 GiB should at least be allocated by the Python process). This is a significant fraction of the 8 GiB RAM. — Jérôme Richard, Aug 20 '23 at 18:38
@JérômeRichard If I put the multiplication result in a preallocated array, the times I get are: 21 ms, 42 ms, 77 ms and 155 ms. Roughly 2x each time. If the multiplication is performed in-place, then the times I get are: 12.5 ms, 25 ms, 50 ms and 102 ms. Roughly 2x again. There's no 4x anymore when going from int8 to int16. How can we explain this? — Avantgarde, Aug 23 '23 at 22:01
I guess this shows that the overhead was the one of the OS and more precisely page faults. The timings are coherent and they indicates that you are likely memory bound now. The throughput is 14.9 GiB/s. This is not the maximum your processor can do (25.6 GiB), but it might be because the operation is sequential and one core is not always enough to fully saturate the bandwidth, or because your RAM does not have a quite low frequency so it is already (close to be) saturated, or this is simply because writes causes the theoretical throughput not to be reachable. — Jérôme Richard, Aug 23 '23 at 22:19
Using: `2*4*1e8/0.05/1024**3`. The 2x is for the reads+writes and the 4x if for the int32 data-type. By the way, it also shows that Numpy should use SIMD instructions. Also, note that you Mac should certainly a LPDDR3/DDR4 operating at 1600 MHz (2 channels). A DDR3 should reach the bandwidth of 23.8 GiB/s and it is very common to get a bandwidth lower than 85% of this in such a use-case so 20.3 GiB/s. So I expect the code to be >70% memory bound, which is good. — Jérôme Richard, Aug 23 '23 at 22:39

score 1 · Answer 1 · answered Aug 20 '23 at 10:42

Two reasons narrower data types make for faster computations:

Less data. An array of 64-bit integers takes four times as much RAM as an array of 16- bit integers. Even with elaborate processor caching it takes time and power to store and retrieve data. More bits of data means more time.
SIMD parallelization: single-instruction multiple-data operations allow modern processors to do multiple simple operations, like "multiply a mess of numbers by seven", or "take the inner product of two long vectors of numbers." The processors offer fewer parallel simultaneous operations for wider data types. And, 8-bit data is a special highly optimized case because of its importance in display operations.

numpy does a great job of exploiting all this parallelism.

Right but think about how you would multiply *bytes* by 7 with AVX2 (or SSE2 if you prefer), it's not that simple. For 16-bit integers it's simple. — harold, Aug 20 '23 at 10:48

Why are vectorized computations on integer arrays faster if a smaller-width integer type is used?

1 Answers1