34

Under what circumstances should I expect memcpys to outperform assignments on modern INTEL/AMD hardware? I am using GCC 4.2.x on a 32 bit Intel platform (but am interested in 64 bit as well).

Setjmp
  • 27,279
  • 27
  • 74
  • 92
  • Interesting question! As you are obviously concerned on how to improve the speed of memory operations: Recently I read about the role of compression in memory transfer from someone developing pyTables: http://www.pytables.org/docs/StarvingCPUs.pdf As described there, the usual use of memcpy might be slow compared to his improvements with very fast compressors ([blosc](http://blosc.pytables.org/trac/)). Please regard this for high performance stuff only! – math Mar 20 '12 at 19:48
  • This question is quite broad. – D. A. Sep 09 '14 at 19:32

1 Answers1

44

You should never expect them to outperform assignments. The reason is, the compiler will use memcpy anyway when it thinks it would be faster (if you use optimize flags). If not and if the structure is reasonable small that it fits into registers, direct register manipulation could be used which wouldn't require any memory access at all.

GCC has special block-move patterns internally that figure out when to directly change registers / memory cells, or when to use the memcpy function. Note when assigning the struct, the compiler knows at compile time how big the move is going to be, so it can unroll small copies (do a move n-times in row instead of looping) for instance. Note -mno-memcpy:

-mmemcpy
-mno-memcpy
    Force (do not force) the use of "memcpy()" for non-trivial block moves.  
    The default is -mno-memcpy, which allows GCC to inline most constant-sized copies.

Who knows it better when to use memcpy than the compiler itself?

Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
Johannes Schaub - litb
  • 496,577
  • 130
  • 894
  • 1,212
  • 4
    Note that the reverse can apply - in GCC at least, memcpy of a small constant size is replaced with copy instructions, and if used with a pointer to a small source and/or destination does *not* prevent one or both being optimised into registers. So: do whatever results in the simplest code. – Steve Jessop Nov 27 '08 at 16:08
  • 4
    You shouldn't expect one to outperform the other. If you have a performance problem, you should profile it, see if assignment/memcpy is the problem, and if so, try changing them to use the other, and see if that performs better. More profiling, less guesswork. ;) – jalf Nov 27 '08 at 16:10
  • 1
    That is to say, I would expect "assignments will outperform memcpy" also to be false, given that the questioner has specified a recent GCC. But assuming no cast is required, I agree with your advice to use assignment, since it results in the clearest code. – Steve Jessop Nov 27 '08 at 16:11
  • @jalf: I totally agree. Since the question was "which is faster?", not "should I care which is faster?", I think "the compiler will deal with it whichever you do" is a fair answer, even though in the big picture the true answer is probably "why are you even asking?" ;-) – Steve Jessop Nov 27 '08 at 16:13
  • @Steve I see what you meant now. I changed "assignments will outperform memcpy" to a more conservative statement now. – Johannes Schaub - litb Aug 19 '11 at 22:31
  • 5
    Never say never... We had done some work on an embedded processor which uses a software unaligned exception handler. We found that structure assignment (using pointers) often caused unaligned exceptions, whereas memcpy did not. The cost of the exceptions was very high, so in the case where the memory was not necessarily aligned, memcpy was MUCH faster than assignment. – John Oct 11 '13 at 18:35