15

Does rewriting memcpy/memcmp/... with SIMD instructions make sense in a large scale software?

If so, why doesn't GCC generate SIMD instructions for these library functions by default?

Also, are there any other functions can be possibly improved by SIMD?

phuclv
  • 37,963
  • 15
  • 156
  • 475
limi
  • 695
  • 1
  • 8
  • 18
  • It depends on what OS and compiler libraries you are using. E.g. Mac OS X already has SIMD-optimised memcpy *et al*. Also Intel's ICC generates inline memcpys which are faster than anything you are likely to be able to implement in a library. – Paul R Mar 16 '11 at 06:41
  • @Paul: `memcpy` is actually the worst case for an SSE intrinsic, because SSE can't be used for the edge cases. Do those compilers emit SIMD code for `strlen` and `memchr`? – Ben Voigt Mar 16 '11 at 13:56
  • @Ben: I just checked with ICC 12 - memcpy and strlen both emit inline SSE code, strchr is a library function which appears to just be straight scalar code. – Paul R Mar 16 '11 at 15:12

5 Answers5

8

Yes, these functions are much faster with SSE instructions. It would be nice if your runtime library/compiler instrinsics would include optimized versions, but that doesn't seem to be pervasive.

I have a custom SIMD memchr which is a hell-of-a-lot faster than the library version. Especially when I'm finding the first of 2 or 3 characters (example, I want to know if there's an equation in this line of text, I search for the first of =, \n, \r).

On the other hand, the library functions are well tested, so it's only worth writing your own if you call them a lot and a profiler shows they're a significant fraction of your CPU time.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • A SIMD memcpy will normally only be faster for copies where source and/or dest are already in cache, since almost any half decent memcpy should be able to saturate the available DRAM bandwidth. – Paul R Mar 16 '11 at 06:38
  • 2
    @Paul: SIMD is better *always*. If it's not strictly faster because memory access can't keep up, that core is freed up for hyperthreading, power saving, or speculative out-of-order execution. As Crashworks said, SSE will also fetch data into cache faster, because of prefetch hinting. Without SSE, the CPU may have to alternate between fetching data and doing the copy, with SSE both occur in parallel. – Ben Voigt Mar 16 '11 at 13:37
  • in the case of memcpy *et al* there isn't anything else going on in the execution thread, so no benefit there. If your core is stalled waiting for a DRAM access there's not much you can do - DRAM latency can be of the order of 200 clocks, which is a lot of instructions cycles with nothing to do. – Paul R Mar 16 '11 at 13:41
  • 2
    @Paul: (1) Not all `memcpy` calls are for thousands of bytes. You may easily have a `memcpy` call for ~20 bytes inside a loop with other processing. (2) Modern CPU cores aren't limited to processing instructions from a single thread, hence my mention of hyperthreading. (3) DRAM latency is less important when read prefetches are pipelined, only throughput is. (4) Even if DRAM throughput is hobbling the code, it's still better to perform the copy efficiently because the CPU can do the work in the same time and less power consumption (for example, dynamically lowered clock frequency) – Ben Voigt Mar 16 '11 at 13:55
  • What craptastic library are you using that doesn't have a good SIMD `memchr`? Glibc's has hand-written asm versions of `memchr` / `strchr` / `memmove` and so on for i386 and x86-64 (and most other ISAs) that are excellent for large buffers, and many have good small-buffer strategies, too. (With runtime dispatching via dynamic linker symbol resolution so it can use AVX2 on compatible CPUs even in binaries compiled without `-mavx2`). The main thing you could gain is if you know your buffer is aligned and/or at least 16 bytes long so you can avoid branching to pick a strategy. – Peter Cordes Feb 11 '20 at 15:24
  • e.g. https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memchr-avx2.S.html is glibc's memchr with `vpcmpeqb` of 4 vectors, then vpor them all together to save on `vpmovmskb` + `test` uops, with a loop branch once per 2 cache lines. – Peter Cordes Feb 11 '20 at 15:25
5

It does not make sense. Your compiler ought to be emitting these instructions implicitly for memcpy/memcmp/similar intrinsics, if it is able to emit SIMD at all.

You may need to explicitly instruct GCC to emit SSE opcodes with eg -msse -msse2; some GCCs do not enable them by default. Also, if you do not tell GCC to optimize (ie, -o2), it won't even try to emit fast code.

The use of SIMD opcodes for memory work like this can have a massive performance impact, because they also include cache prefetches and other DMA hints that are important for optimizing bus access. But that doesn't mean that you need to emit them manually; even though most compiler stink at emitting SIMD ops generally, every one I've used at least handles them for the basic CRT memory functions.

Basic math functions can also benefit a lot from setting the compiler to SSE mode. You can easily get an 8x speedup on basic sqrt() just by telling the compiler to use the SSE opcode instead of the terrible old x87 FPU.

Crashworks
  • 40,496
  • 12
  • 101
  • 170
  • Agreed that `memcpy` is the most likely to be properly optimized. A lot of other functions from `` and `` also benefit immensely and aren't widely optimized by the compiler. – Ben Voigt Mar 16 '11 at 05:40
  • @BenVoigt: GCC doesn't always inline good versions of library functions, but good libraries have good hand-written asm. e.g. [Why is this code 6.5x slower with optimizations enabled?](//stackoverflow.com/q/55563598) shows a case where GCC inlines a very bad `repne scasb` `strlen` at `-O1`, or a complex 32-bit-at-a-time bithack at `-O2` which doesn't take any advantage of SSE2. The program depends entirely on `strlen` performance for huge buffers so it's a big win for it to call glibc's optimized version. There's a big different between library and inline. – Peter Cordes Feb 11 '20 at 15:29
0

It probably doesn't matter. The CPU is much faster than memory bandwidth, and the implementations of memcpy etc. provided by the compiler's runtime library are probably good enough. In "large scale" software your performance is not going to be dominated by copying memory, anyway (it's probably dominated by I/O).

To get a real step up in memory copying performance, some systems have a specialised implementation of DMA that can be used to copy from memory to memory. If a substantial performance increase is needed, hardware is the way to get it.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • That largely depends on whether you're using a horribly slow I/O API like C++ iostreams. It's hard to perform any non-trivial processing at the speed the OS can deliver I/O. Besides, SIMD is faster for a variety of reasons, especially on smaller blocks where the setup of a DMA engine would be prohibitively expensive. For one thing, SSE uses a different set of CPU registers, so your working variables stay enregistered and don't get spilled to cache. – Ben Voigt Mar 16 '11 at 05:37
0

I recommend looking at DPDK memcpy implementation which uses SIMD instructions to have a high throughput memcpy implementation:

https://git.dpdk.org/dpdk/tree/lib/eal/x86/include/rte_memcpy.h

Intel claims 22% better performance for SIMD-memcpy in OpenvSwitch than ordinary memcpy.

From Intel webpage:

Performance comparison between DPDK rte_memcpy and glibc memcpy in OvS-DPDK

Jalal Mostafa
  • 984
  • 10
  • 23
-1

on x86 hardware, it should not matter much, with out-of-order processing. Processor will achieve necessary ILP and try to issue max number of load/store operations per cycle for memcpy, whether it be SIMD or Scalar instruction set.

Pari Rajaram
  • 422
  • 3
  • 7