Multithread Programming for memcpy

Question

I am doing an optimization task for memcpy function, I found this link here. How to increase performance of memcpy

Since I'm not familiar with multithread programming, I don't know how to insert the codes below to the original main function? How to modify the codes in the original question into a multithread memcpy project? I mean, how to create a complete project for this multithread memcpy project. Where are the places for inserting the functions like startCopyThreads or stopCopyThreads or mt_memcpy functions in the original main function?

#define NUM_CPY_THREADS 4

HANDLE hCopyThreads[NUM_CPY_THREADS] = {0};
HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0};
HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0};
typedef struct
{
    int ct;
    void * src, * dest;
    size_t size;
} mt_cpy_t;

mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};

DWORD WINAPI thread_copy_proc(LPVOID param)
{
    mt_cpy_t * p = (mt_cpy_t * ) param;

    while(1)
    {
        WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);
        memcpy(p->dest, p->src, p->size);
        ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);
    }

    return 0;
}

int startCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        mtParamters[ctr].ct = ctr;
        hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0,     NULL); 
}

    return 0;
}

void * mt_memcpy(void * dest, void * src, size_t bytes)
{
    //set up parameters
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes /     NUM_CPY_THREADS;
    }

    //release semaphores to start computation
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);

    //wait for all threads to finish
    WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);

    return dest;
}

int stopCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        TerminateThread(hCopyThreads[ctr], 0);
        CloseHandle(hCopyStartSemaphores[ctr]);
        CloseHandle(hCopyStopSemaphores[ctr]);
    }
    return 0;
}

Where are you going with this? Are you hoping to improve the performance of `memcpy()` by using multiple threads? — NPE, Mar 21 '13 at 07:05
Basically you would call mt_memcpy from your main function. For what its worth, this is unlikely to actually increase the speed of memcpy. The overhead for the semaphores and threads far exceeds the cost of doing most memcpy's. But you should measure it before you use it in your code. — Missaka Wijekoon, Mar 21 '13 at 07:15
The thread that you are refering to dates from 2010, this is very old :) Don't expect to get these things faster without deep knowledge of your architecture (OS and processor). Usually this is quite optimized on modern systems and it is hard to beat this. In any case this is dominated by your memory throughput and not by processing time. — Jens Gustedt, Mar 21 '13 at 07:18
@Missaka Wijekoon How could the author for that question got a result like: 1 thread : 1826 MB/sec 2 threads: 3118 MB/sec 3 threads: 4121 MB/sec. Do you mean I need replace the memcpy function in the original main function by mt_memcpy? And where is the place for inserting the startCopyThreads or stopCopyThreads functions? I just need a complete project to get the result for multiple numbers of threads. Thanks — zjluoxiao, Mar 21 '13 at 07:26
@Jens Gustedt But how could the author get the result like: 1 thread : 1826 MB/sec 2 threads: 3118 MB/sec 3 threads: 4121 MB/sec. I just need how to modify the original codes and get this kind of test result — zjluoxiao, Mar 21 '13 at 07:54
The other questions has a **very** specific hardware configuration: *Hardware details: AMD Magny Cours- 4x octal core 128 GB DDR3* which likely affects the result. Running a thread on a separate core will get you a separate memory controller **on that hardware**, but not on our ordinary desktops. — Bo Persson, Mar 21 '13 at 08:04
On most architectures, a single-threaded `memcpy` is already limited by pure memory bandwidth, not execution speed. Maybe there are some exotic or mystical builds where many threads will perform better in some situations, but there exist many _real_ architectures where it will perform _significantly_ worse (NUMA). Just don't do it. — Damon, Mar 21 '13 at 12:44

score 0 · Answer 1 · answered Mar 21 '13 at 11:05

0

It depends a lot on the architecture and OS.

With one processor:

If you are using threads for memcpy on the machine with only 1 core then it will not show speedup. Reason is, for all threads running on one processor there will be context switching which will be a overhead compare to when you use memcpy without using threads.

With multicore:

In this case, it also depends on the kernel, whether it is mapping your threads on different processors or not as these threads will be user level. If your threads are on different processors running concurrently you may see speedup if memory has dual port access. With single port access I am not sure whether it will have improvement.

answered Mar 21 '13 at 11:05

Plasma

51
3

However, if you have SIMD unit in your processor, using these instructions would definitely help in improvement and you dont need to use threads. – Plasma Mar 21 '13 at 11:08
Do you know how to modify the codes in the link I mentioned above to get the result like:1826 MB/sec 2 threads: 3118 MB/sec 3 threads: 4121 MB/sec.? I mean, even I cannot get the same result. I can test if this multi thread method works in my i7 CPU or not. – zjluoxiao Mar 21 '13 at 14:50

Multithread Programming for memcpy

1 Answers1

Linked