How to increase performance of memcpy

Question

Summary:

memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies?

Full details:

As part of a data capture application (using some specialized hardware), I need to copy about 3 GB/sec from temporary buffers into main memory. To acquire data, I provide the hardware driver with a series of buffers (2MB each). The hardware DMAs data to each buffer, and then notifies my program when each buffer is full. My program empties the buffer (memcpy to another, larger block of RAM), and reposts the processed buffer to the card to be filled again. I am having issues with memcpy moving the data fast enough. It seems that the memory-to-memory copy should be fast enough to support 3GB/sec on the hardware that I am running on. Lavalys EVEREST gives me a 9337MB/sec memory copy benchmark result, but I can't get anywhere near those speeds with memcpy, even in a simple test program.

I have isolated the performance issue by adding/removing the memcpy call inside the buffer processing code. Without the memcpy, I can run full data rate- about 3GB/sec. With the memcpy enabled, I am limited to about 550Mb/sec (using current compiler).

In order to benchmark memcpy on my system, I've written a separate test program that just calls memcpy on some blocks of data. (I've posted the code below) I've run this both in the compiler/IDE that I'm using (National Instruments CVI) as well as Visual Studio 2010. While I'm not currently using Visual Studio, I am willing to make the switch if it will yield the necessary performance. However, before blindly moving over, I wanted to make sure that it would solve my memcpy performance problems.

Visual C++ 2010: 1900 MB/sec

NI CVI 2009: 550 MB/sec

While I am not surprised that CVI is significantly slower than Visual Studio, I am surprised that the memcpy performance is this low. While I'm not sure if this is directly comparable, this is much lower than the EVEREST benchmark bandwidth. While I don't need quite that level of performance, a minimum of 3GB/sec is necessary. Surely the standard library implementation can't be this much worse than whatever EVEREST is using!

What, if anything, can I do to make memcpy faster in this situation?

Hardware details: AMD Magny Cours- 4x octal core 128 GB DDR3 Windows Server 2003 Enterprise X64

Test program:

#include <windows.h>
#include <stdio.h>

const size_t NUM_ELEMENTS = 2*1024 * 1024;
const size_t ITERATIONS = 10000;

int main (int argc, char *argv[])
{
    LARGE_INTEGER start, stop, frequency;

    QueryPerformanceFrequency(&frequency);

    unsigned short * src = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);
    unsigned short * dest = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);

    for(int ctr = 0; ctr < NUM_ELEMENTS; ctr++)
    {
        src[ctr] = rand();
    }

    QueryPerformanceCounter(&start);

    for(int iter = 0; iter < ITERATIONS; iter++)
        memcpy(dest, src, NUM_ELEMENTS * sizeof(unsigned short));

    QueryPerformanceCounter(&stop);

    __int64 duration = stop.QuadPart - start.QuadPart;

    double duration_d = (double)duration / (double) frequency.QuadPart;

    double bytes_sec = (ITERATIONS * (NUM_ELEMENTS/1024/1024) * sizeof(unsigned short)) / duration_d;

    printf("Duration: %.5lfs for %d iterations, %.3lfMB/sec\n", duration_d, ITERATIONS, bytes_sec);

    free(src);
    free(dest);

    getchar();

    return 0;
}

EDIT: If you have an extra five minutes and want to contribute, can you run the above code on your machine and post your time as a comment?

My notebook shows the same memory bandwidth. But a quickly engineered sse2/4 algorithm did not improve the performance (only marginally). — Christopher, Nov 23 '10 at 22:09
More testing with SSE code did only lead to a speed up of 60 MB/sec over the memcpy algorithm in VC2010. The Core-i5 Laptop peaked at about 2,224 GB/sec (shouldn't this number be doubled? we are writing this number and reading it at the same time, so ~4,4 GB/sec ...). Either something can be done, that I overlooked or you really have to 'not-copy' your data. — Christopher, Nov 23 '10 at 23:20
Check out onemasse's answer (William Chan's SSE2 ASM implementation of memcpy) - using memcpy and CopyMemory, I get 1.8GB/s. With William's implementation, I got 3.54GB/s (that's nearly double!). This is on Core2Duo wolfdale with 2 channel DDR2 at 800MHz. — Zach Saw, Nov 24 '10 at 12:01
Further to my answer below, it has just occurred to me that the transfer of data from the capture card will consume some of the memory bandwidth available to the CPU, I think you'd lose about 33% (memcpy = read/write, with capture card = write/read/write), so your in-app memcpy will be slower than a benchmark memcpy. — Skizz, Dec 08 '10 at 00:54
Macbook Retina Pro Core, i7 2.6GHz (Win 7 x64 via Bootcamp) : 8474 MB/Sec. Compiler is Embarcadero C++Builder 2010 — Roddy, May 21 '13 at 09:39
Interesting topic. Core i7-4770 3.4GHz, 2x8GB DDR3 1600MHz CL9. VS 2013 x64 rls build. 5 measurements. Win8: ~13GB/s, Win7: ~11GB/s — waldez, Oct 09 '14 at 14:15

score 36 · Accepted Answer · answered Nov 24 '10 at 17:28

I have found a way to increase speed in this situation. I wrote a multi-threaded version of memcpy, splitting the area to be copied between threads. Here are some performance scaling numbers for a set block size, using the same timing code as found above. I had no idea that the performance, especially for this small size of block, would scale to this many threads. I suspect that this has something to do with the large number of memory controllers (16) on this machine.

Performance (10000x 4MB block memcpy):

 1 thread :  1826 MB/sec
 2 threads:  3118 MB/sec
 3 threads:  4121 MB/sec
 4 threads: 10020 MB/sec
 5 threads: 12848 MB/sec
 6 threads: 14340 MB/sec
 8 threads: 17892 MB/sec
10 threads: 21781 MB/sec
12 threads: 25721 MB/sec
14 threads: 25318 MB/sec
16 threads: 19965 MB/sec
24 threads: 13158 MB/sec
32 threads: 12497 MB/sec

I don't understand the huge performance jump between 3 and 4 threads. What would cause a jump like this?

I've included the memcpy code that I wrote below for other that may run into this same issue. Please note that there is no error checking in this code- this may need to be added for your application.

#define NUM_CPY_THREADS 4

HANDLE hCopyThreads[NUM_CPY_THREADS] = {0};
HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0};
HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0};
typedef struct
{
    int ct;
    void * src, * dest;
    size_t size;
} mt_cpy_t;

mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};

DWORD WINAPI thread_copy_proc(LPVOID param)
{
    mt_cpy_t * p = (mt_cpy_t * ) param;

    while(1)
    {
        WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);
        memcpy(p->dest, p->src, p->size);
        ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);
    }

    return 0;
}

int startCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        mtParamters[ctr].ct = ctr;
        hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0, NULL); 
    }

    return 0;
}

void * mt_memcpy(void * dest, void * src, size_t bytes)
{
    //set up parameters
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes / NUM_CPY_THREADS;
    }

    //release semaphores to start computation
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);

    //wait for all threads to finish
    WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);

    return dest;
}

int stopCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        TerminateThread(hCopyThreads[ctr], 0);
        CloseHandle(hCopyStartSemaphores[ctr]);
        CloseHandle(hCopyStopSemaphores[ctr]);
    }
    return 0;
}

Quite an old thread but I thought I'd add something: cache line coherence. Look it up. Probably explains the massive jump. Just by chance, of course. Knowing about this (Sutter writes about it), you can make an intelligent memcpy that makes use of it for near perfect scaling. — Robinson, Jul 25 '13 at 16:11
@Robinson: definitely a good thing to look at. In the past few years, I think I have concluded that this ended up being a NUMA performance issue. — leecbaker, Jul 25 '13 at 16:22
FWIW, I tried you code on my i5-2430M laptop. The number of threads makes little difference. 1, 2, 4 and 8 threads are basically the same speed. The fastest memcpy I have found was from hapalibashi answer on this question: http://stackoverflow.com/questions/1715224/very-fast-memcpy-for-image-processing. — Andrew Bainbridge, Aug 15 '13 at 11:58
@leecbaker, The huge jump in performance on 4+ threads is from cache. When 1, 2 or 3 cores are running your copy, there is another CPU which is running something else or idling. The cache is almost never distributed dynamically and therefore the entire CPU cache is not used for caching your reads and stores which is the case when you spawn 4+ threads. Also, your code is difinately wrong, Just look at the code for calculating copy size for each thread. — sgupta, Dec 29 '13 at 04:21

onemasse · Answer 2 · 2014-04-23T14:24:19.040

9

I'm not sure if it's done in run time or if you have to do it compile time, but you should have SSE or similar extensions enabled as the vector unit often can write 128 bits to the memory compared to 64 bits for the CPU.

~~Try this implementation.~~

Yeah, and make sure that both the source and destination is aligned to 128 bits. If your source and destination are not aligned respective to each other your memcpy() will have to do some serious magic. :)

edited Apr 23 '14 at 14:24

answered Nov 23 '10 at 20:49

onemasse

6,514
8
32
37

1

You'll need to align /both/ source and dest to 16-byte (not 32-bits). William Chan's code is using movdqa (a for aligned). See http://siyobik.info/index.php?module=x86&id=183. You should also allocate cache-aligned memory for that last drop of performance. – Zach Saw Nov 24 '10 at 12:20
Yes, I said "at least". But of course it makes sence to align the data to 128 bits if you want to do vector based I/O. I've corrected my answer. – onemasse Nov 24 '10 at 14:45
Ahh. I thought you meant the implementation you posted in the link. – Zach Saw Nov 25 '10 at 00:06

score 5 · Answer 3 · edited May 23 '17 at 12:10

One thing to be aware of is that your process (and hence the performance of memcpy()) is impacted by the OS scheduling of tasks - it's hard to say how much of a factor this is in your timings, bu tit is difficult to control. The device DMA operation isn't subject to this, since it isn't running on the CPU once it's kicked off. Since your application is an actual real-time application though, you might want to experiment with Windows' process/thread priority settings if you haven't already. Just keep in mind that you have to be careful about this because it can have a really negative impact in other processes (and the user experience on the machine).

Another thing to keep in mind is that the OS memory virtualization might have an impact here - if the memory pages you're copying to aren't actually backed by physical RAM pages, the memcpy() operation will fault to the OS to get that physical backing in place. Your DMA pages are likely to be locked into physical memory (since they have to be for the DMA operation), so the source memory to memcpy() is likely not an issue in this regard. You might consider using the Win32 VirtualAlloc() API to ensure that your destination memory for the memcpy() is committed (I think VirtualAlloc() is the right API for this, but there might be a better one that I'm forgetting - it's been a while since I've had a need to do anything like this).

Finally, see if you can use the technique explained by Skizz to avoid the memcpy() altogether - that's your best bet if resources permit.

To lock pages it's SetProcessWorkingSetSize and VirtualLock. — Skizz, Nov 24 '10 at 09:41

Skizz · Answer 4 · 2010-11-23T21:42:11.277

You have a few barriers to obtaining the required memory performance:

Bandwidth - there is a limit to how quickly data can move from memory to the CPU and back again. According to this Wikipedia article, 266MHz DDR3 RAM has an upper limit of around 17GB/s. Now, with a memcpy you need to halve this to get your maximum transfer rate since the data is read and then written. From your benchmark results, it looks like you're not running the fastest possible RAM in your system. If you can afford it, upgrade the motherboard / RAM (and it won't be cheap, Overclockers in the UK currently have 3x4GB PC16000 at £400)
The OS - Windows is a preemptive multitasking OS so every so often your process will be suspended to allow other processes to have a look in and do stuff. This will clobber your caches and stall your transfer. In the worst case your entire process could be cached to disk!
The CPU - the data being moved has a long way to go: RAM -> L2 Cache -> L1 Cache -> CPU -> L1 -> L2 -> RAM. There may even be an L3 cache. If you want to involve the CPU you really want to be loading L2 whilst copying L1. Unfortunately, modern CPUs can run through an L1 cache block quicker than the time taken to load the L1. The CPU has a memory controller that helps a lot in these cases where your streaming data into the CPU sequentially but you're still going to have problems.

Of course, the faster way to do something is to not do it. Can the captured data be written anywhere in RAM or is the buffer used at a fixed location. If you can write it anywhere, then you don't need the memcpy at all. If it's fixed, could you process the data in place and use a double buffer type system? That is, start capturing data and when it's half full, start processing the first half of the data. When the buffer's full, start writing captured data to the start and process the second half. This requires that the algorithm can process the data faster than the capture card produces it. It also assumes that the data is discarded after processing. Effectively, this is a memcpy with a transformation as part of the copy process, so you've got:

load -> transform -> save
\--/                 \--/
 capture card        RAM
   buffer

instead of:

load -> save -> load -> transform -> save
\-----------/
memcpy from
capture card
buffer to RAM

Or get faster RAM!

EDIT: Another option is to process the data between the data source and the PC - could you put a DSP / FPGA in there at all? Custom hardware will always be faster than a general purpose CPU.

Another thought: It's been a while since I've done any high performance graphics stuff, but could you DMA the data into the graphics card and then DMA it out again? You could even take advantage of CUDA to do some of the processing. This would take the CPU out of the memory transfer loop altogether.

Skizz, I am not doing any mathematical processing on the data as it comes in- only copying to a different buffer, so another DMA, or DSP/FPGA use won't help. The data does come in via a double buffer system- actually a queue of 4 or more buffers, and is copied to a static long buffer (10GB+). — leecbaker, Nov 23 '10 at 23:04
As to the faster RAM: the system currently has 16 channels of PC3-10600, which is rated for 10.7GB/s theoretical peak transfer rate (each channel). While I realize that I realize I can't even come close to this peak rating, I think I should still have some headroom in the hardware performance of the RAM. — leecbaker, Nov 23 '10 at 23:07
The data is collected and stored in RAM, and after all data is collected, the whole lot is processed. The collection is the performance sensitive part that I am concerned with. — leecbaker, Nov 24 '10 at 00:01

score 2 · Answer 5 · answered Nov 23 '10 at 20:46

2

First of all, you need to check that memory is aligned on 16 byte boundary, otherwise you get penalties. This is the most important thing.

If you don't need a standard-compliant solution, you could check if things improve by using some compiler specific extension such as memcpy64 (check with your compiler doc if there's something available). Fact is that memcpymust be able to deal with single byte copy, but moving 4 or 8 bytes at a time is much faster if you don't have this restriction.

Again, is it an option for you to write inline assembly code?

answered Nov 23 '10 at 20:46

Simone

11,655
1
30
43

Inline assembly is an option, but other commenters here have noted that it doesn't yield a significant improvement. Also, I have just verified that all memory blocks are 16-byte aligned. – leecbaker Nov 23 '10 at 22:55
can you post here on SO what assembly produces your compiler? – Simone Nov 24 '10 at 07:01

score 2 · Answer 6 · answered Nov 23 '10 at 20:54

2

Perhaps you can explain some more about how you're processing the larger memory area?

Would it be possible within your application to simply pass ownership of the buffer, rather than copy it? This would eliminate the problem altogether.

Or are you using memcpy for more than just copying? Perhaps you're using the larger area of memory to build a sequential stream of data from what you've captured? Especially if you're processing one character at a time, you may be able to meet halfway. For example, it may be possible to adapt your processing code to accommodate for a stream represented as ‘an array of buffers’, rather than ‘a continuous memory area’.

answered Nov 23 '10 at 20:54

Stéphan Kochen

19,513
9
61
50

During the data capture period, I'm not doing anything to the data in the storage buffer. It gets dumped to a file at a later period. – leecbaker Nov 23 '10 at 21:06
1

Is it possible to capture directly into the larger memory area? You can build up an array of buffer pointers in order, then write them out. (You *might* even be able to use `WriteFileGather` to get vectored IO, but it has some rather strict alignment requirements.) – Stéphan Kochen Nov 23 '10 at 21:21

score 2 · Answer 7 · answered Nov 23 '10 at 21:26

You can write a better implementation of memcpy using SSE2 registers. The version in VC2010 does this already. So the question is more, if you are handing it aligned memory.

Maybe you can do better then the version of VC 2010, but it does need some understanding, of how to do it.

PS: You can pass the buffer to the user mode program in an inverted call, to prevent the copy altogether.

score 1 · Answer 8 · answered Nov 23 '10 at 21:36

One source I would recommend you read is MPlayer's fast_memcpy function. Also consider the expected usage patterns, and note that modern cpus have special store instructions which let you inform the cpu whether or not you will need to read back the data you're writing. Using the instructions that indicate you won't be reading back the data (and thus it doesn't need to be cached) can be a huge win for large memcpy operations.

How to increase performance of memcpy

8 Answers8

Linked