I am trying to optimize some code for speed and its spending a lot of time doing memcpys. I decided to write a simple test program to measure memcpy on its own to see how fast my memory transfers are and they seem very slow to me. I am wondering what might cause this. Here is my test code:
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>
#define MEMBYTES 1000000000
int main() {
clock_t begin, end;
double time_spent[2];
int i;
// Allocate memory
float *src = malloc(MEMBYTES);
float *dst = malloc(MEMBYTES);
// Fill the src array with some numbers
begin = clock();
for(i=0;i<250000000;i++)
src[i]=(float) i;
end = clock();
time_spent[0] = (double)(end - begin) / CLOCKS_PER_SEC;
// Do the memcpy
begin = clock();
memcpy(dst, src, MEMBYTES);
end = clock();
time_spent[1] = (double)(end - begin) / CLOCKS_PER_SEC;
//Print results
printf("Time spent in fill: %1.10f\n", time_spent[0]);
printf("Time spent in memcpy: %1.10f\n", time_spent[1]);
printf("dst[200]: %f\n", dst[400]);
printf("dst[200000000]: %f\n", dst[200000000]);
//Free memory
free(src);
free(dst);
}
/*
gcc -O3 -o mct memcpy_test.c
*/
When I run this, I get the following output:
Time spent in fill: 0.4263950000
Time spent in memcpy: 0.6350150000
dst[200]: 400.000000
dst[200000000]: 200000000.000000
I think the theoretical memory bandwith for modern machines is tens of GB/s or perhaps over 100 GB/s. I know in practice one cannot expect to hit the theoretical limits, and that for large memory transfers things can be slow, but I have seen people reporting measured speeds for large transfers of ~20GB/s (e.g. here). My results suggest I am getting 3.14GB/s (edit: I originally had 1.57, but stark pointed out in a comment that I need to count both read and write). I am wondering if anyone has ideas that might help or ideas of why the performance I am seeing is so low.
My machine has two CPUS with 12 physical cores each (Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz) There is 192GB of RAM (I believe its 12x16GB DDR4-2666) The OS is Ubuntu 16.04.6 LTS
My compiler is: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Update
Thanks to all the valuable feedback I am now using a threaded implementation and getting much better performance. Thank you!
I had tried threading before posting with poor results (I thought), but as pointed out below I should have ensured I was using wall time. Now my results with 24 threads are as follows:
Time spent in fill: 0.4229530000
Time spent in memcpy (clock): 1.2897100000
Time spent in memcpy (gettimeofday): 0.0589750000
I am also using asmlib's A_memcpy with a large SetMemcpyCacheLimit value.