C/C++ memcpu benchmark: measuring CPU and wall time

Question

How can one benchmark memcpy? I wrote test code, but it finishes immediately (probably, due to compiler optimization) and does not actually allocate memory:

void test(void)
{
 const uint32_t size = 4000'000'000;
 char a[size], b[size];
 printf("start\n");
 for(int i=0; i<10'000'000; i++)
     memcpy(b, a, size*sizeof(char));
 printf("end\n");
}// end of function

I want to know the cost of memcpy in terms of CPU time and in terms of wall time.

Here is the situation: I need to process incoming (through network) data at high rate. If I do not process it fast enough, the network buffers get overfilled and I am disconnected from the data source (which happens in my test code quite frequently). I can see that the CPU usage of my process is quite low (10-15%) and so there should be some operation that costs time without costing CPU time. And so, I want to estimate the contribution of memcpy operations to the wall time it takes to process one unit of data. The code is basically some computing and memory copy operations: there is no resource, which I need to wait for, that could slow me down.

Thank you for your help!

[EDIT:]

Thank you very much for your comments! And sorry for having an example which is not C (C++ only) - my priority was readability. Here is a new example of the code, which shows that memcpy is not free and consumes 100% of CPU time:

const uint32_t N = 1000'000'000;
char *a = new char[N], 
     *b = new char[N];
void test(void)
{
 for(uint32_t i=0; i<N; i++)
     a[i] = '7';

 printf("start\n");
 for(int i=0; i<100; i++)
     memcpy(b, a, N*sizeof(char));
 printf("end\n");
}// end of function

which makes me confused about why I have low CPU usage but do not process incoming data quickly enough.

Micro benchmarking is hard. Fortunately google benchmark makes it easier to do. YOu can play with an online version at [quick-bench](http://quick-bench.com/) — NathanOliver, Nov 08 '19 at 21:12
Also note that `sizeof(char)` is **always** `1` so it is not needed. — NathanOliver, Nov 08 '19 at 21:13
Are quotes in integer literals allowed in C? I think they are not and I would remove them from the code in that case (to make it C and C++). — walnut, Nov 08 '19 at 21:15
If you want to find the source of a performance problem in an actual running program, you should use a profiler to find it, not benchmark individual parts of it. A profiler gives you a more or less exact time spent at each part of your code. — walnut, Nov 08 '19 at 21:19
@uneven_mark: That's correct, `40'000` is C++ only, not C. https://godbolt.org/z/eFjjwG shows that, and also that the `memcpy` optimizes away out of this loop when compiled with optimization enabled. You could use global arrays + memory barriers like `asm("":::"memory")` to make it happen, or maybe GCC `-fno-builtin-memcpy`. Making the arrays global would also avoid overflowing the stack from huge arrays in automatic storage. — Peter Cordes, Nov 08 '19 at 21:28
The reason your code "finishes immediately" is because it appears to crash in a fiery explosion. The shown code appears to attempt to allocate slightly less than 8 gigabytes worth of memory for two arrays. In automatic storage. This is not going to end well. — Sam Varshavchik, Nov 08 '19 at 21:28
@SamVarshavchik: It optimizes away because they're both unused other than memcpy, unless you disable optimization. — Peter Cordes, Nov 08 '19 at 21:28
Splendid. And if it doesn't optimize away, it'll still blow up. Either way, this isn't going anywhere. — Sam Varshavchik, Nov 08 '19 at 21:29
@SamVarshavchik: Yes, exactly. In my comment posted seconds before yours, I made the same point about using global arrays to avoid stack overflow when you do prevent them from optimizing away. My point was that optimizing away is a likely explanation for finishing right away, and doesn't require the OP to have missed an error message about their program faulting. — Peter Cordes, Nov 08 '19 at 21:29
@S.V: why do you want to time 4GiB memcpy when your real problem is network packets? glibc memcpy uses a different strategy (NT stores) for very huge copies. And the Linux kernel's `read` / `recv` paths end up using `copy_to_user`, I assume, which uses a different memory-copy function: hopefully `rep movsb` on x86 CPUs with the ERMSB feature. [Enhanced REP MOVSB for memcpy](//stackoverflow.com/q/43343231) goes over a bunch of x86 memory / cache performance details. — Peter Cordes, Nov 08 '19 at 21:32
Regarding your edit: If you just wanted to see whether `memcpy` of a large array will take 100% CPU user time, then everyone could have told you so beforehand. As @PeterCordes commented: Where in packet networking do you have `memcpy`'s of gigabytes of data? The test does not seem to reflect anything in your actual code. Again I suggest looking at the output of a profiler instead... — walnut, Nov 08 '19 at 22:10
Thank you, everybody, for your comments! They are all very useful. My edit meant to be a first attempt to reconcile low CPU usage and not being able to process data quickly enough. And so, the idea was to test if memory copy is done by directly copying data in RAM with small participation of CPU (which is more likely to see if RAM chunks are large, and so the process is not dominated by CPU time). I am trying to be systematic and exclude all explanations no matter how improbable they might appear. — S.V, Nov 08 '19 at 22:26

score 1 · Accepted Answer · answered Nov 08 '19 at 22:31

the idea was to test if memory copy is done by directly copying data in RAM with small participation of CPU (which is more likely to see if RAM chunks are large, and so the process is not dominated by CPU time).

No, memcpy on normal computers doesn't offload to a DMA engine / blitter chip and let the CPU do other things until that completes. The CPU itself does the copying, so as far as the OS is concerned memcpy is no different from any other instructions user-space could be running.

A C++ implementation on an embedded system or an Atari Mega ST could plausibly do that, letting the OS schedule another task or at least do some housekeeping. Although only with very lightweight context switching because it doesn't take very long at all to copy even a huge block of memory.

An easier way to find that out would be to single-step into the memcpy library function. (And yes, with your update gcc doesn't optimize away the memcpy.)

Other than that, testing a 4GiB memcpy isn't very representative for network packets. glibc memcpy on x86 uses a different strategy (NT stores) for very huge copies. And for example the Linux kernel's read / recv paths end up using copy_to_user, I assume, which uses a different memory-copy function: hopefully rep movsb on x86 CPUs with the ERMSB feature.

See Enhanced REP MOVSB for memcpy for a bunch of x86 memory / cache performance details.

C/C++ memcpu benchmark: measuring CPU and wall time

1 Answers1