0

I use memcpy() to write data to a device, with a logic analyzer/PCIe analyzer, I can see the actual stores.

My device gets more stores than expected.

For example,

auto *data = new uint8_t[1024]();
for (int i=0; i<50; i++){
  memcpy((void *)(addr), data, i);
}

For i=9, I see these stores:

  • 4B from byte 0 to 3
  • 4B from byte 4 to 7
  • 3B from byte 5 to 7
    • 1B-aligned only, re-writing the same data -> inefficient and useless store
  • 1B the byte 8

In the end, all the 9 Bytes are written but memcpy creates an extra store of 3B re-writing what it has already written and nothing more.

Is it the expected behavior? The question is for C and C++, I'm interested in knowing why this happens, it seems very inefficient.

Mooing Duck
  • 64,318
  • 19
  • 100
  • 158
Alexis
  • 2,136
  • 2
  • 19
  • 47
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/242311/discussion-on-question-by-alexis-is-it-normal-memcpy-overwrites-data-it-just-wro). – Samuel Liew Feb 23 '22 at 11:32
  • `new` operator is not a C operator, please, take the C tag off, or use a syntax compatible with C language specs. – Luis Colorado Feb 28 '22 at 16:19

2 Answers2

1

The following illustrates why memcpy may be implemented this way.

To copy 9 bytes, starting at a 4-byte aligned address, memcpy issues these instructions (described as pseudo code):

  • Load four bytes from source+0 and store four bytes to destination+0.
  • Load four bytes from source+4 and store four bytes to destination+4.
  • Load four bytes from source+5 and store four bytes to destination+5.

The processor implements the store instructions with these data transfer in the hardware:

  • Since destination+0 is aligned, store 4 bytes to destination+0.
  • Since destination+4 is aligned, store 4 bytes to destination+4.
  • Since destination+5 is not aligned, store 3 bytes to destination+3 and store 1 byte to destination+8.

This is an easy and efficient way to write memcpy:

  • If length is less than four bytes, jump to separate code for that.
  • Loop copying four bytes until fewer than four bytes are left.
  • if length is not a multiple of four, copy four bytes from source+length−4 to destination+length−4.

That single step to copy the last few bytes may be more efficient than branching to three different cases with various cases.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • Why would memcpy load 4B from src+5 to dst+5? 5 isn't aligned. Why not 1B to dst+8? I read everywhere memcpy read/write byte after byte, not dword after dword. (libgcc/memcpy.c) Your source? – Alexis Feb 23 '22 at 08:49
  • 1
    @Alexis: Re “Why would memcpy load 4B from src+5 to dst+5?”: The answer explains that: “That single step to copy the last few bytes may be more efficient than branching to three different cases with various cases.” If the code in `memcpy` is `… for (size_t x = 0; x+4 <= length; x += 4) Copy4Bytes(destination+x, source+x); if (length % 4) Copy4Bytes(destination+length, source+length);`, that may be better than replacing the last Copy4Bytes with a `switch` statement with multiple `case` labels to handle the residues 1, 2, and 3 separately. The tests and branches may take more time than the copy. – Eric Postpischil Feb 23 '22 at 12:35
  • But the memcpy's code isn't that, it's byte after byte. It's up to the compiler to generate all the actual stores. – Alexis Feb 23 '22 at 12:37
  • @Alexis: Compile your source code with `-S` and show us the generated assembly. If the compiler did not optimize `memcpy` to inline code, then step through it in a debugger and see what instructions it executes. – Eric Postpischil Feb 23 '22 at 12:37
  • 1
    @Alexis: There is no specification of `memcpy` that says it must be implemented via byte-after-byte instructions. – Eric Postpischil Feb 23 '22 at 12:38
1

Is it the expected behavior?

The expected behavior is that it can do anything it feels like (including writing past the end, especially in a "read 8 bytes into a register, modify the first byte in the register, then write 8 bytes" way) as long as the result works as if the rules for the C abstract machine were followed.

Using a logic analyzer/PCIe analyzer to see the actual stores is so far beyond the scope of "works as if the rules for the abstraction machine were followed" that it's unreasonable to have any expectations.

Specifically; you can't assume the writes will happen in any specific order, can't assume anything about the size of any individual write, can't assume writes won't overlap, can't assume there won't be writes past the end of the area, can't assume writes will actually occur at all (without volatile), and can't even assume that CHAR_BIT isn't larger than 8 (or that memcpy(dest, source, 10); isn't asking to write 20 octets/"8 bit bytes").

If you need guarantees about writes, then you need to enforce those guarantees yourself (e.g. maybe create a structure of volatile fields to force the compiler to ensure writes happen in a specific order, maybe use inline assembly with explicit fences/barriers, etc).

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • Well, software isn't magic, `memcpy(dest, source, 10);` should write 10Bytes and shouldn't go below `dest/source` and beyond `dest+10/source+10`. If that's really the case, wtf. Any source for that "pray memcpy's god to do what you ask"? Not interested in number of writes nor the order, but I do care when it issues useless stores. – Alexis Feb 23 '22 at 09:13
  • @Alexis: C defines "byte" as "char", which is however many bits `CHAR_BIT` happens to say, which is at least 8 but maybe more. In practice, there are some CPUs (e.g. Analog Devices' SHARC DSP) where `CHAR_BIT == 32`, or where `memcpy(dest, source, 10);` writes 10 x 32 bits. Most people assume "byte" is a synonym for "octet" (a group of 8 bits) and would say "10 x 32 bits = 40 octets = 40 bytes". For a source for "it does what it feels like" see: https://stackoverflow.com/questions/15718262/what-exactly-is-the-as-if-rule – Brendan Feb 23 '22 at 10:44
  • That's completely different than `can't even assume that CHAR_BIT isn't larger than 8`. Common sense would assume the arch to be x86/64 unless otherwise specified. For x86, we can't assume, it **is actually guaranteed** a byte is 8b. Why do you add complexity where it's unnecessary?! Actually I'm using the wrong function to copy data, I need to build my own. – Alexis Feb 23 '22 at 11:49
  • @Alexis: The width of a byte is indeed not relevant to this question. However, for general knowledge, the width of a byte in a C implementation is determined by the C implementation, not by the hardware it runs on. Most C implementations are designed to be efficient on their target hardware and hence choose to use a byte matching the hardware design. However, C implementations can also be designed to support old software or for other special purposes, and they may choose other widths for a byte. Such choices conform to the C standard as long as the width is at least eight bits. – Eric Postpischil Feb 23 '22 at 13:12