0

I went on to test memcpy behavior on my system after seeing this Why does the speed of memcpy() drop dramatically every 4KB?

Details of my system:

arun@arun-OptiPlex-9010:~/mem_copy_test$ lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Stepping:              9
CPU MHz:               1600.000
BogoMIPS:              6784.45
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7

arun@arun-OptiPlex-9010:~/mem_copy_test$ cat /proc/cpuinfo | grep 'model name'| head -1

model name  : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

arun@arun-OptiPlex-9010:~/mem_copy_test$ uname -a

Linux arun-OptiPlex-9010 3.13.0-40-generic #69-Ubuntu 
SMP Thu Nov 13 17:53:56 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Test program:

#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>
#include <string.h>

void memcpy_speed(unsigned long buf_size, unsigned long iters)
{
    struct timeval start,  end;
    unsigned char * pbuff_1;
    unsigned char * pbuff_2;
    int i;

    pbuff_1 = (void *)malloc(buf_size);
    pbuff_2 = (void *)malloc(buf_size);

    gettimeofday(&start, NULL);
    for(i = 0; i < iters; ++i){
        memcpy(pbuff_2, pbuff_1, buf_size);
    }   
    gettimeofday(&end, NULL);

    printf("%5.3f\n", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - \
    start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
    free(pbuff_1);
    free(pbuff_2);
}

main()
{
    unsigned long buf_size;
    unsigned int i;
    buf_size = 1;
    for (i = 1; i < 16385 ; i++) {
        printf("bufsize in kb=%d speed=", i);
        buf_size = i * 1024;
        memcpy_speed(buf_size, 10000);
        printf("\n");
    }
}

I am sharing the output from my google drive as stackoverflow is not allowing me to post images(says 10 reps needed for that)

Output for 1 to 256 KB:https://drive.google.com/file/d/0B3mnbsS6F4tpY2dhRWJLaEY1RWc/view?usp=sharing

output for 1 to 16384 KB:https://drive.google.com/file/d/0B3mnbsS6F4tpeC1Dd2R1VnJOV2c/view?usp=sharing

1) Why the graph has a peak @11-13KB?

2) why behavior from 20 to 129KB9(range1) and 130 to 256KB(range2) are different?(range1 has max speed not at multiples of 4 but range2 has max speed at multiples of 4; that too with large peaks; also range2 has better speed than range1 at multiples of 4)

3) Why the speed reduces dramatically close to 3000KB?

--Arun

Community
  • 1
  • 1
  • Side note: In order to filter out undesired impact of factors that are not related to your test, I suggest that you allocate `pbuff_1` and `pbuff_2` statically (and even better - globally) to the maximum possible size of `16384 * 1024` entries. – barak manos Dec 26 '14 at 08:22
  • Because that's when the "from" and "to" line up in the cache and loading one invalidates the other. – U2EF1 Dec 26 '14 at 08:23
  • @barakmanos that is why I am doing memcpy 10000 times between malloc and free. I feel a large number like 10000 can eliminate any parameters related to test setup. It is accrding to comments from http://stackoverflow.com/questions/21038965/why-speed-of-memcpy-drops-dramatically-every-4kb – Arun chandran Dec 26 '14 at 15:10

1 Answers1

0

memcpy uses different copying algorithms depending on the size (and alignmewnt) of the data given it, you will also see the effects of L1 ans L2 cache. once the data you want to move overflows the cache then that cache will no-longer be of assistance to the process

Jasen
  • 11,837
  • 2
  • 30
  • 48