Does memory copying on APUs (e.g. apple m1 mac) use GPU-specific wide vector instructions?

Question

I was reading this article Why mmap is faster than system calls, where the main difference appeared to be mmap's ability to use vector instructions like AVX-2, something system calls can't.

I understand that the SIMD instructions used by GPUs tend to be much wider. A Nvidia warp of size 32 operating on float32 = 1024 bits (?) vs 256 bits of AVX-2. So potentially a 4x speedup. I guess this is not used in traditional discrete gpu settings as host-to-device (and back) copy would outweigh any benefit from wide registers.

However in APUs, GPU shares memory with CPU, eliminating the need for these expensive copies. I was wondering if those GPU instructions can therefore be used to accelerate mmap like vector operations further (numpy is another example). Has it already been done (in M1 mac or any CPUs with integrated graphics)? or can you please detail the architectural issues that prevent this?

If you think this question isn't fit for SO (or has voted to close it), please let me know why. I'm open to any suggestions to improve it. I've thought about the ideal venue for this Q, and one close voter has referred it to Superuser, but I respectfully disagree with that judgment. — Shihab Shahriar Khan, Jan 10 '21 at 22:54
I wouldn't expect that Linux kernel uses GPU for *common* `memcpy` function. Using GPU implies needs to setup GPU **context** (at least, GPU registers) and needs to protect this context from concurrent threads. This is the same reason as for not using FPU in the kernel. It could be, however, that some specific in-kernel functions actually use GPU. — Tsyvarev, Jan 11 '21 at 10:12

score 1 · Accepted Answer · answered Jan 11 '21 at 12:43

You're kind of asking 2 separate questions: whether an OS (or user-space standard libraries?) can use GPGPU to speed up reading from the pagecache (into user-space memory with a read system call, or from an mmaped region). And separately whether GPGPU on normally-allocated process memory (and/or the pagecache) can avoid a copy to memory dedicated to the GPU.

For the 2nd part Apple has said the answer is yes for MacOS on M1 thanks to making the integrated GPU's memory accesses cache-coherent with the CPU. I think AMD made similar suggestions that copying could be avoided in graphics or GPGPU drivers on their APUs (Fusion IIRC?), but IDK if software ever took full advantage.

For the first part; doubtful. Large memory copies are bottlenecked by DRAM bandwidth, not CPU-core <-> L1d cache bandwidth (which scales with SIMD register width). On x86, an AVX2 loop on a single core can come pretty close to maxing out the DRAM bandwidth of an Intel "client" chip (quad-core or similar, not a big xeon with a higher-latency interconnect). Single-core bandwidth (to L3 or DRAM) tends to be limited by the number of outstanding cache misses that a core can track, not by doing the copy with fewer instructions. That mostly helps in terms of seeing farther with the same size out-of-order execution window, to start page walks sooner across page boundaries and stuff like that. See Why is std::fill(0) slower than std::fill(1)? for SSE (16-byte) vs. AVX (32-byte) vectors.

GPU offload would thus not help for large copies. It could only possibly help for small copies, and then it would not leave the copy result hot in L1d cache of the CPU. And/or not be able to take advantage of the source or destination already being hot in L1d cache of a CPU working with the data.

Also, setup overhead (to communicate with the GPU, going outside the current core) would dominate any faster copying for small copies.

Does memory copying on APUs (e.g. apple m1 mac) use GPU-specific wide vector instructions?

1 Answers1