Implementing own memcpy (size in bytes?)

Question

I recently had an interview question where I had to implement memcpy. I've used memcpy plenty in my experience so it didn't seem like a tough problem.

So, I started implementing a loop to copy one address at a time from pointer to pointer, something like this:

void memcpy(void* dest, void* src, int size){
    for(int index = 0; index < size; index++){
        dest[index] = src[index];
    }
}

However the interviewers interrupted noting that the man page for memcpy says it "copies n bytes from src to dest" (which I confirmed later) and then wanted me to iterate instead by size/4 and then pick up the remaining with another loop of index < size%4 (I guess assuming it was a 32 bit system?)

Well, this seemed strange seeing as how I've used memcpy for years with no problem without having to give it a *4 modifier). When I got home I fired up gdb and copied a small string "hello" and was careful to input the size both with strlen() and constants to see where it starts and stops.

    char* src = "hello";
    char* dest = calloc(16, sizeof(char));
    int len = strlen(src);
    memcpy(dest, src, len); // both my version and official version

Now I carefully examined src and dest with gdb which both contained "hello\0".

So my question is: what am I not understanding about using the number 4, (or "size in bytes")? And why does the documentation say "n bytes" when that's not really the behavior? What am I not seeing clearly here?

The prototype should be `void memcpy (void* dest, const void* src, size_t size);`. It should return a void* if it needs to be compatible with C standard memcpy(). — Lundin, Aug 09 '12 at 06:45
@Lundin: Small correction, I think you made a typo here, prototype should be "void* memcpy (void* dest, const void* src, size_t size);" — Rndp13, Jun 26 '18 at 08:55
@Rndp13 Read the whole comment. To be picky, the standard C prototype is `void* memcpy (void*restrict s1, const void*restrict s2, size_t n);` — Lundin, Jun 26 '18 at 09:58

score 15 · Answer 1 · answered Aug 09 '12 at 04:00

15

As others have said, copying 4 bytes at a time is faster than copying 1 byte at a time. The interviewer wanted you to do something like this:

void memcpy(void* dest, void* src, int size)
{
    uint8_t *pdest = (uint8_t*) dest;
    uint8_t *psrc = (uint8_t*) src;

    int loops = (size / sizeof(uint32_t));
    for(int index = 0; index < loops; ++index)
    {
        *((uint32_t*)pdest) = *((uint32_t*)psrc);
        pdest += sizeof(uint32_t);
        psrc += sizeof(uint32_t);
    }

    loops = (size % sizeof(uint32_t));
    for (int index = 0; index < loops; ++index)
    {
        *pdest = *psrc;
        ++pdest;
        ++psrc;
    }
}

answered Aug 09 '12 at 04:00

Remy Lebeau

555,201
31
458
770

Thank you for the demonstration. I understand now. I have come to terms with the fact that I misunderstood how memory is laid out. I thought that each individual address was capable of storing an entire word, not just a byte (for instance address 0x12345678 could hold a value of 0xffffffff and then 0x12345679 could hold a completely different value 0x00000000). Now I realize that 0x12345678 holds only one byte of a word and the next word will start at 0x12345678 + 4. This can easily be seen in gdb with the x command. – VaporwareWolf Aug 09 '12 at 04:05
@Rey Lebeau May I ask the reason behind copying 4 bytes is faster than copying one byte? assuming 32-bit system. I guess it saves the time of aligning each byte within each word in the one byte situation? Also, when copying the remainder bytes, doesn't it need to cast the pointer back to char* to copy one byte at a time? *pdest = *psrc; ++pdest; ++psrc; – Cong Hui Oct 23 '13 at 23:25
Never mind I second question, didn't see the void* pointer was casted to char* already. – Cong Hui Oct 23 '13 at 23:34
`uint8_t` and `uint32_t` are defined in `#include ` . [source](https://stackoverflow.com/a/8953288/5407848) – Accountant م Apr 20 '19 at 19:08
1

@Accountantم in C, yes. In C++, use `` instead – Remy Lebeau Apr 20 '19 at 21:42
This does not handle several special cases correctly: (1) `dest` and `src` are unaligned (by the same amount). Byte-by-byte copying would be needed to get them aligned before jumping into the `uint32_t` loop. (2) `dest` and `src` are unaligned by different amounts. This can't be fixed. The whole thing must be done byte-by-byte. – John Kugelman May 03 '21 at 13:11

score 13 · Accepted Answer · answered Aug 09 '12 at 03:31

13

They were asking you to optimize your implementation and have it copy a 32-bit word at a time inside the loop vs. a byte at a time. This would necessitate some careful checking to handle the boundary cases, such as size not being a multiple of 4, or dest or src not being aligned on a 4-byte boundary.

answered Aug 09 '12 at 03:31

John Kugelman

349,597
67
533
578

But "dest[index] = src[index];" doesn't just copy one byte. I realize that void* cannot be deferferences, so suppose src and dest were int*. That operation doesn't copy a byte, it copies all 32 bits of the integer. One a 32bit machine, each address can have 32bits of data, correct? I am not understanding. – VaporwareWolf Aug 09 '12 at 03:37
1

Minor nit-pick: dest and src don't have to be word-aligned. They can be misaligned by the same amount, though of course you'll have to special-case the first few bytes as well as the last few. – Marcelo Cantos Aug 09 '12 at 03:37
@HCHogan: Since you declared the types as `void*`, most people reading your question assumed you meant `char*`, which is the usual choice for a naive memcpy. That's why John suggested you were copying a byte at a time. – Marcelo Cantos Aug 09 '12 at 03:42

score 1 · Answer 3 · answered Aug 09 '12 at 03:34

The logic for your memcpy is correct and your interviewer didn't ask you to change it or add a restriction. Copying 4 bytes at a time is faster, but becomes a problem if your size is not a multiple of 4. Hence your interviewer told you to use two loops: the first copies 4 bytes at a time, and the 2nd loop one byte at a time (it will iterate at most 3 times).

So the bulk of the copy is done with the fast 4-byte copy, but you're not restricted to size being a multiple of 4 because the 2nd "cleanup" loop will copy anything that's not a multiple of 4.

1st loop: copy uint32_t and increment by 4
2nd loop: copy uint8_t and increment by 1

There can be problems even if the total data to copy is a multiple of 4 bytes. The first pointer might point to a byte address of the form 4N+1, and the second to a byte address of the form 4N+2; you can't easily do copying in multiples of 4 bytes when that's the case. — Jonathan Leffler, Aug 09 '12 at 04:04

score 1 · Answer 4 · answered Aug 09 '12 at 03:41

The interviewer was testing your knowledge of computer architecture, and wanted you to optimize your algorithm. Memory operates on words rather than bytes. In a 32-bit system, a word is typically 4 bytes, it takes the same amount of time to read/write 1 byte as it does to read/write 1 word. The second loop is to handle the case where the number of bytes you want to copy is not exactly divisible by 4 bytes.

What you actually want is 3 loops. 2 loops for the bytes after dest and before dest+size that when either is not word aligned. Then another loop for all the words in between.

You can actually optimize even more by taking advantage of architecture specific instructions. Check out this article if you are interested: http://www.eetimes.com/design/embedded/4024961/Optimizing-Memcpy-improves-speed

score 1 · Answer 5 · answered Aug 09 '12 at 06:41

The interviewers asked you to perform premature optimization, for one reason or another. This is usually a bad idea.

It is true that a 32-bit machine will copy one 32-bit chuck faster than it will copy 4x1 bytes. But there is more to optimization than that.

There is a big chance that a 32-bit machine puts your data in cache memory, and then suddenly fast memory access might be far more relevant than CPU instructions. Cache memories might have various alignment requirements. They may prefer a plain loop or they may prefer 32-bit aligned chunks. I'm not an expert on the subject, so I avoid premature optimization and leaves it to the compiler, which hopefully knows more about cache memories than I do.

Then there is CPU branch prediction and instruction piping. This particular code is fairly deterministic, so this might not be an issue. But as a rule of thumb: simple code yields more effective branch prediction than complex code.

Furthermore, there is division, which is slow on many CPU architectures. Depending on the amount of data to copy, the divisions may cause the memcpy to be far slower.

To sum it up: manual optimization is quite complex and requires in-depth knowledge of the CPU and hardware. You cannot and should not "optimize for a 32-bit CPU", you need to know the specifics. In most cases the compiler will optimize the code far more effectively than you will. The library memcpy() in particular, is often written in inline assembler, optimized for the specific target.

Again, people with sufficient experience will know that caches too are aligned, and at even coarser boundaries. 16 bytes is a common value. "Division" is similarly a red herring. Division by a constant power of two is for all intents and purposes free, because it's a bitshift (`>>2`). — MSalters, Aug 09 '12 at 07:47

score 0 · Answer 6 · answered Aug 09 '12 at 03:31

0

They wanted you to speed it up. A 32-bit processor can copy 32 bits faster than it can copy 8 bits. So if someone wants to copy 4 bytes rather than do it one at a time then you can do it all at once.

answered Aug 09 '12 at 03:31

dave

4,812
4
25
38

But "dest[index] = src[index];" copies the entire register, not just a byte. (Suppose I changed my function signature to int* instead of void*). – VaporwareWolf Aug 09 '12 at 03:40
But then your loop is wrong and copies 4 times the number of bytes that you want to. They ignored the type problems (as that's easily fixed, and wasn't what they wanted to find out from you in an interview ....) – dave Aug 09 '12 at 05:10

score 0 · Answer 7 · answered Apr 24 '17 at 06:34

Check this out..

void myMemCpy(void *dest, void *src, size_t n)
{
   // Typecast src and dest addresses to (char *)
   char *csrc = (char *)src;
   char *cdest = (char *)dest;

   // Copy contents of src[] to dest[]
   for (int i=0; i<n; i++)
       cdest[i] = csrc[i];
}

For more info

Implementing own memcpy (size in bytes?)

7 Answers7

Linked