Performance of memcpy() worsens with the number of threads

Question

Consider following function for duplicating lines in image:

void DuplicateRows(char* image_in, char *image_out, int width, int height)
{
    for(int row = 0; row < height; i++)
    {
         memcpy(image_out + (2 * row)*width, image_in + row*width, width);
         memcpy(image_out + (2 * row + 1)*width, image_in + row*width, width);
    }
}

When I try to split image into several slices and assign each slice to separate thread(say, rows from 0-539 to Threads1, 540-1079 - Thread2), running time worsens with number of threads. Is there explanation for this? (I suspect that bottleneck is memory access which is serialized)

More detailed:

The test I ran was the following(It does not have 2 memcpy-s, but that does not matter, the example was just to prove usefullness):

#include <vector>
#include <thread>
#include <functional>
#include <condition_variable>
#include <mutex>
#include <iostream>
#include <chrono>

const int height = 1080;
const int width = 3840;

condition_variable cv;
mutex mu;
int finished;
void execute(vector<unsigned char>&vec_in, vector<unsigned char>& vec_out, int factor)
{
    auto src_row_ptr = &vec_in[0];
    auto dst_row_ptr = &vec_out[0];

    for(int i = 0; i<height/factor; i++)
    {
        memcpy(dst_row_ptr, src_row_ptr, width);

        src_row_ptr+= width;
        dst_row_ptr+= width;
    }

    unique_lock<mutex> lock(mu);

    finished++;

    lock.unlock();
    cv.notify_one();
}   


void check1thread()
{
    using namespace std::chrono;
    finished =0;
    cout<<"Checking 1 thread ... \n";
    vector<unsigned char> vec1(height * width, 1);
    vector<unsigned char> vec1_res(height * width ,0);

    auto tm0 = high_resolution_clock::now();
    auto src_row_ptr = &vec1[0];
    auto dst_row_ptr = &vec1_res[0];

    for(int i = 0; i<height; i++)
    {
        memcpy(dst_row_ptr, src_row_ptr, width);

        src_row_ptr+= width;
        dst_row_ptr+= width;
    }

    auto tm1 = high_resolution_clock::now();
    cout<<"work done\n";

    cout<<duration_cast<microseconds>(tm1-tm0).count() << " microseconds passed \n";

    cin.get();

}

void check2threads()
{
    using namespace std::chrono;
    finished =0;
    cout<<"Checking 2 thread ... \n";
    vector<unsigned char> vec1(height/2 * width, 1);
    vector<unsigned char> vec1_res(height/2 * width ,0);

    vector<unsigned char> vec2(height/2 * width, 1);
    vector<unsigned char> vec2_res(height/2 * width, 0);

    auto tm0 = high_resolution_clock::now();

    thread t1(execute, std::ref(vec1), std::ref(vec1_res) ,2 );
    thread t2(execute, std::ref(vec2), std::ref(vec2_res) ,2 );

    unique_lock<mutex> ul(mu);
    cv.wait(ul, [](){return finished == 2;} );

    auto tm1 = high_resolution_clock::now();
    cout<<"work done\n";

    cout<<duration_cast<microseconds>(tm1-tm0).count() << " microseconds passed \n";


    t1.join();
    t2.join();
}


int main()
{
    check1thread();
    check2threads();
    cin.get();
}

Please create a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) and show us. It's possible that you actually do the processing serially, but it's impossible to say without seeing more code. — Some programmer dude, Oct 30 '14 at 09:24
You may find some useful info in [this SO thread](http://stackoverflow.com/q/15541231/3242721). — Michal Hosala, Oct 30 '14 at 09:26
Looks like the task for OpenMP or other parallel library. You can add OpenMP support here by adding one #pragma line. Then test performance in Release configuration. Of course, images must be large enough to have any performance boost from parallelizing. — Alex F, Oct 30 '14 at 09:26
A comment in [this thread](http://stackoverflow.com/questions/15145152/is-memcpy-process-safe) suggests that Solaris `memcpy()` uses a mutex to make it thread-safe, that could slow things down. — Barmar, Oct 30 '14 at 09:29
I would imagine that the performance of the CPU's data cache is suffering when you switch to another thread that's operating on a different memory block. — Andy Brown, Oct 30 '14 at 09:34
I would bet that it has to do with processor cache locality. Try doing one thread the top half and another the bottom half, instead of alternated slices. — rodrigo, Oct 30 '14 at 09:34
You are copying the lines interleaved in chunks of 1. Probably cache line overlap (false sharing). Maybe the CPU prefetches far into the next row. Rows are too small of a unit. Distribute rows in chunks as big as possible. — usr, Oct 30 '14 at 09:39
You still don't show the complete code. What does `execute` do? Does it do anything with the condition variable or mutex (which I don't see a use for at the moment, just start both threads, then immediately join them, when both have joined you take your ending timestamp). — Some programmer dude, Oct 30 '14 at 09:53
3840*540 is a multiple of 1024 (2025*1024), so it is a large multiple of a power of two ---> thrashing cache due to associativity. — Damon, Oct 30 '14 at 13:07

Performance of memcpy() worsens with the number of threads

0 Answers0