Understanding the source code of memcpy()

Question

00018 void *memcpy(void *dst, const void *src, size_t len)
00019 {
00020         size_t i;
00021 
00022         /*
00023          * memcpy does not support overlapping buffers, so always do it
00024          * forwards. (Don't change this without adjusting memmove.)
00025          *
00026          * For speedy copying, optimize the common case where both pointers
00027          * and the length are word-aligned, and copy word-at-a-time instead
00028          * of byte-at-a-time. Otherwise, copy by bytes.
00029          *
00030          * The alignment logic below should be portable. We rely on
00031          * the compiler to be reasonably intelligent about optimizing
00032          * the divides and modulos out. Fortunately, it is.
00033          */
00034 
00035         if ((uintptr_t)dst % sizeof(long) == 0 &&
00036             (uintptr_t)src % sizeof(long) == 0 &&
00037             len % sizeof(long) == 0) {
00038 
00039                 long *d = dst;
00040                 const long *s = src;
00041 
00042                 for (i=0; i<len/sizeof(long); i++) {
00043                         d[i] = s[i];
00044                 }
00045         }
00046         else {
00047                 char *d = dst;
00048                 const char *s = src;
00049 
00050                 for (i=0; i<len; i++) {
00051                         d[i] = s[i];
00052                 }
00053         }
00054 
00055         return dst;
00056 }

I was just going through an implementation of memcpy, to understand how it differs from using a loop. But I couldn't see any difference between using a loop rather than memcpy, as memcpy uses loop again internally to copy.

I couldn't understand if part they do for integers — i < len/sizeof(long). Why is this calculation required?

Where's this code come from? I've seen better optimized memcpy implementations... — Maxime Chéramy, Jul 11 '13 at 11:08
@Maxime: how would you know: you don't even know the target processor (or compiler for that matter)! — Olof Forshell, Jul 11 '13 at 11:20
@Angus, judging by your answer in http://stackoverflow.com/questions/11772553/why-padding-is-not-happening-in-this-case/11773431#11773431 you seem to understand alignment. The `long` is a processor word and the address needs to be processor word aligned for a faster copy (most architectures do faster copies on aligned data). If you can't do it, then do it slowly, byte by byte. There are good answers below. — nonsensickle, Jul 11 '13 at 11:20
@OlofForshell because if not aligned and not a multiple of `sizeof(long)`, it does a slow copy. It's easy to imagine to copy some aligned data with a size that is not equal to a multiple of `sizeof(long)` and finish with a slow copy. But you are true, for very specific target processors and use cases, this could be better. But this code aims to be portable and for the majority of cases, this could be improved. — Maxime Chéramy, Jul 11 '13 at 11:35
Doesn't this function break aliasing rules (accessing memory via a long* that wasn't necessarily declared to be long) and therefore is undefined behavour? — jcoder, Jul 11 '13 at 12:04
@jcoder the implementation's implementation doesn't have to follow any rules. It would be a violation if you copied this code into a function of your own name and used it as such. — M.M, Aug 21 '14 at 19:47

score 19 · Answer 1 · answered Jul 11 '13 at 11:04

19

I couldn't understand if part they do for integers. i < len/sizeof(long). Why is this calculation required ?

Because they are copying words, not individual bytes, in this case (as the comment says, it is an optimization - it requires less iterations and the CPU can handle word aligned data more efficiently).

len is the number of bytes to copy, and sizeof(long) is the size of a single word, so the number of elements to copy (means, loop iterations to execute) is len / sizeof(long).

answered Jul 11 '13 at 11:04

Andreas Fester

36,091
7
95
123

Word-aligned copy does not actually depend on CPU. It depends on how wired is the RAM. – m0skit0 Jul 11 '13 at 11:17
1

@m0skit0 I would say it depends on the architecture - there are CPUs which can not even access data at non-aligned addresses, and there are ones which can access individual bytes at all addresses, but then these accesses might be much slower than accessing an aligned address – Andreas Fester Jul 11 '13 at 11:21
1

Yes I know. And precisely that's because on how wired is the RAM ;) Anyway, this is a detail. Your answer is correct. – m0skit0 Jul 11 '13 at 11:37
2

Probably should have said "architecture" in my answer instead of "CPU", as you did :) – Andreas Fester Jul 11 '13 at 11:45

score 6 · Answer 2 · answered Jul 11 '13 at 11:06

to understand how it differs from using a loop. But I couldn't any difference of using a loop rather than memcpy, as memcpy uses loop again internally to copy

Well then it uses a loop. Maybe other implementations of libc doesn't do it like that. Anyway, what's the problem/question if it does use a loop? Also as you see it does more than a loop: it checks for alignment and performs a different kind of loop depending on the alignment.

I couldn't understand if part they do for integers. i < len/sizeof(long). Why is this calculation required ?

This is checking for memory word alignment. If the destination and source addresses are word-aligned, and the length copy is multiple of word-size, then it performs an aligned copy by word (long), which is faster than using bytes (char), not only because of the size, but also because most architectures do word-aligned copies much faster.

It's only worth doing it by 4 bytes if conditions apply. Otherwise word copy might be slower than byte copy. In fact in some architectures you can only copy memory zones do it if they are word aligned, otherwise it's an hardware exception (e.g. MIPS IIRC). — m0skit0, Jul 12 '13 at 09:53

score 6 · Answer 3 · edited Aug 28 '17 at 10:10

len%sizeof(long) checks if you are trying to copy full-longs not a part of long.

00035    if ((uintptr_t)dst % sizeof(long) == 0 &&
00036             (uintptr_t)src % sizeof(long) == 0 &&
00037             len % sizeof(long) == 0) {
00038 
00039                 long *d = dst;
00040                 const long *s = src;
00041 
00042                 for (i=0; i<len/sizeof(long); i++) {
00043                         d[i] = s[i];
00044                 }

checks for alignment and if true, copies fast(sizeof(long) bytes at a time).

00046    else {
00047                 char *d = dst;
00048                 const char *s = src;
00049 
00050                 for (i=0; i<len; i++) {
00051                         d[i] = s[i];
00052                 }
00053    }

this is for the mis-aligned arrays (slow copy (1 byte at a time))

score 4 · Answer 4 · answered Jul 11 '13 at 11:04

4

for (i=0; i<len/sizeof(long); i++) {
    d[i] = s[i];
}

In this for loop, every time a long is copied, there are a total size of len to copy, that's why it needs i<len/sizeof(long) as the condition to terminate the loop.

answered Jul 11 '13 at 11:04

Yu Hao

119,891
44
235
294

score 0 · Answer 5 · answered Jun 21 '19 at 09:21

I was just going through an implementation of memcpy, to understand how it differs from using a loop. But I couldn't see any difference between using a loop rather than memcpy, as memcpy uses loop again internally to copy.

Loop (control statements) is one of the basic elements adjacent to if (decision statements) and few other such things. So the question here is not about what is the difference between normal looping and using memcpy.

memcpy just aids your task by providing you with a ready to use API call, instead of having you to write 20 lines of code for a petty thing. If you wish so, you can choose to write your own code to provide you with the same functionality.

Second point as already pointed out earlier is that, the optimization it provides between long data type and other types. Because in long it is copying a block of data at once what we call a word instead of copying byte by byte which would take longer time. In case of long, the same operation that would require 8 iterations to complete, memcpy does it in a single iteration by copying the word at once.

score 0 · Answer 6 · answered Dec 26 '19 at 08:57

As if you see assembly code of memcpy it show that in 32 bit system each register is 32 bit it can store 4 byte at a time, if you will copy only one byte in 32 bit register, CPU need extra Instruction cycle.

If len/count is aliged in the multiple of 4 , we can copy 4 byte in one cycle

    MOV FROM, R2
    MOV TO,   R3
    MOV R2,   R4
    ADD LEN,  R4
CP: MOV (R2+), (R3+) ; "(Rx+)" means "*Rx++" in C
    CMP R2, R4
    BNE CP

Understanding the source code of memcpy()

6 Answers6

Linked