Poor memcpy performance

Question

I am trying to optimize some code for speed and its spending a lot of time doing memcpys. I decided to write a simple test program to measure memcpy on its own to see how fast my memory transfers are and they seem very slow to me. I am wondering what might cause this. Here is my test code:

#include <stdio.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>

#define MEMBYTES 1000000000

int main() {
  clock_t begin, end;
  double time_spent[2];
  int i;

  // Allocate memory                                                                                                                                    

  float *src = malloc(MEMBYTES);
  float *dst = malloc(MEMBYTES);


  // Fill the src array with some numbers                                                                                                               
  begin = clock();
  for(i=0;i<250000000;i++)
    src[i]=(float) i;
  end = clock();
  time_spent[0] = (double)(end - begin) / CLOCKS_PER_SEC;


  // Do the memcpy                                                                                                                                      
  begin = clock();
  memcpy(dst, src, MEMBYTES);
  end = clock();
  time_spent[1] = (double)(end - begin) / CLOCKS_PER_SEC;

  //Print results                                                                                                                                       
  printf("Time spent in fill: %1.10f\n", time_spent[0]);
  printf("Time spent in memcpy: %1.10f\n", time_spent[1]);
  printf("dst[200]: %f\n", dst[400]);
  printf("dst[200000000]: %f\n", dst[200000000]);

  //Free memory                                                                                                                                         
  free(src);
  free(dst);
}

/*                                                                                                                                                      
                                                                                                                                                        
  gcc -O3 -o mct memcpy_test.c                                                                                                                          
                                                                                                                                                        
*/

When I run this, I get the following output:

Time spent in fill: 0.4263950000
Time spent in memcpy: 0.6350150000
dst[200]: 400.000000
dst[200000000]: 200000000.000000

I think the theoretical memory bandwith for modern machines is tens of GB/s or perhaps over 100 GB/s. I know in practice one cannot expect to hit the theoretical limits, and that for large memory transfers things can be slow, but I have seen people reporting measured speeds for large transfers of ~20GB/s (e.g. here). My results suggest I am getting 3.14GB/s (edit: I originally had 1.57, but stark pointed out in a comment that I need to count both read and write). I am wondering if anyone has ideas that might help or ideas of why the performance I am seeing is so low.

My machine has two CPUS with 12 physical cores each (Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz) There is 192GB of RAM (I believe its 12x16GB DDR4-2666) The OS is Ubuntu 16.04.6 LTS

My compiler is: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609

Update

Thanks to all the valuable feedback I am now using a threaded implementation and getting much better performance. Thank you!

I had tried threading before posting with poor results (I thought), but as pointed out below I should have ensured I was using wall time. Now my results with 24 threads are as follows:

Time spent in fill: 0.4229530000
Time spent in memcpy (clock): 1.2897100000
Time spent in memcpy (gettimeofday): 0.0589750000

I am also using asmlib's A_memcpy with a large SetMemcpyCacheLimit value.

When I run your code several times I get variations >10%. If that was your average time that is 3.1 GBPS (250M reads + 250M writes x 4 bytes / 0.635) — stark, Mar 11 '22 at 18:13
@stark You are right... I should have doubled because of read and write... still I think its very slow. — Jed, Mar 11 '22 at 18:21
My system is 1.8GHz CPU (6-core), two 64-bit 4GB SODIMM DDR4 2400 RAM and is getting 0.351s (5.7GBPS) so yours does seem a little slow. Check DMI info to see if your memory is rtunning at full speed.. — stark, Mar 11 '22 at 18:47
Note that you can get interesting information about you CPU [here](https://www.intel.com/content/www/us/en/products/sku/120483/intel-xeon-gold-6126-processor-19-25m-cache-2-60-ghz/specifications.html). You can see that there is 6 channel resulting in a maximum theoretical throughput of 120 GiB/s. You should be able to reach at least 60 GiB/s in practice in parallel (probably 80-100 GiB/s). By the way, your version of GCC is pretty old (6 year old). Please consider using a newer version like GCC 11. — Jérôme Richard, Mar 11 '22 at 19:18
@JérômeRichard: I think it's 2 separate chips with 6 channels per chip, so it'd be closer to a 240 GiB/s max theoretical. I'd also be tempted to expect poor NUMA optimization (e.g. all memory from the same NUMA node so an entire memory controller is doing nothing to improve throughput). — Brendan, Mar 11 '22 at 19:45
Indeed, I missed the "two CPUS". Thank you. For NUMA, this is a (very) complex topic. One way to mitigate NUMA effect without delving into the subject too much is simply to interleave pages between NUMA nodes. **`numactl`** can be used to control that. I expect at lest a 50% improvement with it (up to 2 theoretically). This is especially important if the input array is filled sequentially (due to the default first-touch local-allocation policy on most platforms). — Jérôme Richard, Mar 11 '22 at 20:31

score 1 · Accepted Answer · answered Mar 11 '22 at 19:06

Saturating RAM is not as simple as is seems.

First of all, at first glance here is the apparent throughput we can compute from the provided numbers:

Fill: 1 / 0.4263950000 = 2.34 GB/s (1 GB is read);
Memcpy: 2 / 0.6350150000 = 3.15 GB/s (1 GB is read and 1 GB is written).

The thing is that the pages allocated by malloc are not mapped in physical memory on Linux systems. Indeed, malloc reserve some space in virtual memory, but the pages are only mapped in physical memory when a first touch is performed causing expensive page faults. AFAIK, the only way speed up this process is to use multiple cores or to prefill the buffers and reuse them later.

Additionally, due to architectural limitations (ie. latency), one core of a Xeon processor cannot saturate the RAM. Again, the only way to fix that is to use multiple cores.

If you try to use multiple core, then the result provided by the benchmark will be surprising since clock does not measure the wall-clock time but the CPU time (which is the sum of the time spent in all threads). You need to use another function. In C, you can use gettimeofday (which is not perfect as it is not monotonic) but certainly good-enough for your benchmark (related post: How can I measure CPU time and wall clock time on both Linux/Windows?). In C++, you should use std::steady_clock (which is monotonic as opposed to std::system_clock).

In addition, the write-allocate cache policy on x86-64 platform force cache lines to be read when they are written. This means that to write 1 GB, you actually need to read 1 GB! That being said, x86-64 processors provide non-temporal store instructions that does not cause this issue (assuming your array is aligned properly and big enough). Compilers can use that but GCC and Clang generally does not. memcpy is already optimized to use non-temporal stores on most machines. For more information, please read How do non temporal instructions work?.

Finally, you can parallelize the benchmark easily using OpenMP with simple #pragma omp parallel for directives on loops. Note that is also provide a user-friendly function for computing the wall-clock time correctly: omp_get_wtime. For the memcpy, the best is certainly to write a loop doing memcpy by (relatively big) chunks in parallel.

For more information about this subject, I advise you to read the great famous document: What Every Programmer Should Know About Memory. Since the document is a bit old, you can check the updating information about this here. The document also describe additional important things to understand why you may still not succeed saturate the RAM with the above information. One critical topic is NUMA.

Thanks for all the very valuable information! I had actually run this with multiple pthreads before posting and found (I thought) that it did not help, but I was using the clock timing function not realizing what it did, so I am glad you pointed me to gettimeofday. Now with 24 threads (and using asmlib memcpy with a large MemCpyCacheLimit), I get much better results! — Jed, Mar 11 '22 at 20:10

Poor memcpy performance

1 Answers1