Code runs 6 times slower with 2 threads than with 1

Question

Original Problem:

So I have written some code to experiment with threads and do some testing.

The code should create some numbers and then find the mean of those numbers.

I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.

void findmean(std::vector<double>*, std::size_t, std::size_t, double*);


int main(int argn, char** argv)
{

    // Program entry point
    std::cout << "Generating data..." << std::endl;

    // Create a vector containing many variables
    std::vector<double> data;
    for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);

    // Calculate mean using 1 core
    double mean = 0;
    std::cout << "Calculating mean, 1 Thread..." << std::endl;
    findmean(&data, 0, data.size(), &mean);
    mean /= (double)data.size();

    // Print result
    std::cout << "  Mean=" << mean << std::endl;

    // Repeat, using two threads
    std::vector<std::thread> thread;
    std::vector<double> result;
    result.push_back(0.0);
    result.push_back(0.0);
    std::cout << "Calculating mean, 2 Threads..." << std::endl;

    // Run threads
    uint32_t halfsize = data.size() / 2;
    uint32_t A = 0;
    uint32_t B, C, D;
    // Split the data into two blocks
    if(data.size() % 2 == 0)
    {
        B = C = D = halfsize;
    }
    else if(data.size() % 2 == 1)
    {
        B = C = halfsize;
        D = hsz + 1;
    }

    // Run with two threads
    thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
    thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));

    // Join threads
    thread[0].join();
    thread[1].join();

    // Calculate result
    mean = result[0] + result[1];
    mean /= (double)data.size();

    // Print result
    std::cout << "  Mean=" << mean << std::endl;

    // Return
    return EXIT_SUCCESS;
}


void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
    for(uint32_t i = 0; i < length; i ++) {
        *result += (*datavec).at(start + i);
    }
}

I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.

Register Variable:

Several people have suggested making a local variable for the function 'findmean'. This is what I have done:

void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
    holding += (*datavec).at(start + i);
}
*result = holding;
}

I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?

Register Variable and O2 Optimization:

I have set the optimization to 'O2' - I will create a table with the results.

Results so far:

Original Code with no optimization or register variable: 1 thread: 4.98 seconds, 2 threads: 29.59 seconds

Code with added register variable: 1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds

With reg variable and -O2 optimization: 1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?

With Dameon's suggestion, which was to put a large block of memory in between the two result variables: 1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds

With TAS 's suggestion of using iterators to access contents of the vector: 1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds

Same as above on Core i7 920 (single channel memory 4GB): 1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds

Same as above on Core i7 920 (dual channel memory 2x2GB): 1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds

Since the threads access memory concurrently at exactly 128 MiB offset, this might be false sharing. — Damon, Jun 27 '13 at 16:20
That sounds a little slow. In experiments with multiple processors back ca 1982 the app only ran about 30% slower on 2 processors vs 1. — Hot Licks, Jun 27 '13 at 16:21
@Damon, Could you explain that a bit? Do you mean there is an overlap between summation at 128MB offset? What is false sharing? — FreelanceConsultant, Jun 27 '13 at 16:21
also threads are expensive to create and dispose, if you are not doing complex calculation they are a waste — LemonCool, Jun 27 '13 at 16:22
Actually, it's true sharing -- of the cache line. It's just not visible at the HLL programming level. — Hot Licks, Jun 27 '13 at 16:23
@Edward Bird: It basically means the threads compete for the same cache lines (this depends on associativity and cache size, it's a bit complicated... but I think there's a comprehensive formula at Agner Fog's site, IIRC). You might try dividing work differently, for example in somewhat smaller (say, 128k or 512k) blocks. If that "magically" solves the performance problem, you know. — Damon, Jun 27 '13 at 16:26
How many cores on your machine? Threads only give speedup if they let you put otherwise idle hardware to work. — Mike Dunlavey, Jun 27 '13 at 16:37
@MikeDunlavey This is running on an old Q6600 - 4 Cores, 4 Threads — FreelanceConsultant, Jun 27 '13 at 16:40
@Damon, So you are suggesting divide the work by running 2 threads which process perhaps 1/16 of the data at a time? — FreelanceConsultant, Jun 27 '13 at 16:41
@EdwardBird: That'd be a way to rule out (or prevent) false sharing, yes. — Damon, Jun 27 '13 at 17:00
@Damon Could you explain to me why that is likely to prevent false sharing, since result[0] and result[1] will still be adjacent? — FreelanceConsultant, Jun 27 '13 at 17:03
@EdwardBird: That's a good point, you will in addition want to insert some padding in between those two (64 bytes will do), as if these are in the same cache line they'll of course cause false sharing for every single write, too (if the compiler doesn't keep the value in a register). — Damon, Jun 27 '13 at 17:07
@Damon, Ah, right I think I get you... So something like: An array of return values, where I use only the first and last elements, which are spread quite far apart (in this case 64 bytes) ? — FreelanceConsultant, Jun 27 '13 at 17:10
@Damon, I tried that but it only made the execution (slightly, might not be significant) slower. — FreelanceConsultant, Jun 27 '13 at 17:13
@EdwardBird In any case, padding the 64 bytes, or using a temp variable with both get rid of the false sharing. Using a temp variable eliminates the memory accesses. Padding puts them on different cachelines. Your 2 thread slowdown is probably just due to overhead from having two threads or from the two threads contending for the same memory channel. But either way, it's normal to see slowdowns once you're used more threads than is optimal. "Optimal" here seems to be 1 thread. — Mysticial, Jun 27 '13 at 17:47
How this is a false sharing if he iterates over two halves of large vector in different threads? — SChepurin, Jun 27 '13 at 19:01
@SChepurin - Originally I was writing to a vector of length 2 (2 doubles) - see the original code above — FreelanceConsultant, Jun 27 '13 at 19:04
@TAS Haha, very funny - the point was for this to serve as a test for using two threads with vectors. The point being a lot of the stuff I write requires accessing all elements inside a vector and doing some arithmetic with them. — FreelanceConsultant, Jun 27 '13 at 19:08

Mysticial · Accepted Answer · 2013-06-27T16:48:49.467

20

Why are 2 threads 6x slower than 1 thread?

You are getting hit by a bad case of false sharing.

After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?

You are bottlenecked by your memory bandwidth.

False Sharing:

The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.

Each thread is running this loop:

for(uint32_t i = 0; i < length; i ++) {
    *result += (*datavec).at(start + i);
}

And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.

Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.

Memory Bandwidth:

Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.

Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.

In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.

edited Jun 27 '13 at 16:48

answered Jun 27 '13 at 16:21

Mysticial

464,885
45
335
332

Should I create two independent result variables NOT inside a vector or will this not help? – FreelanceConsultant Jun 27 '13 at 16:23
@EdwardBird Try to declare a local variable to hold the sum and only write to it at the end of the loop. But I'm still unsure since the compiler should be keeping `*result` in a register anyway. – Mysticial Jun 27 '13 at 16:25
@EdwardBird Quick question, did you enable optimizations? – Mysticial Jun 27 '13 at 16:33
@Mystical, No I have no optimizations turned on. I am using g++-4.8 on Code::Blocks, overnight build from 3 days ago. – FreelanceConsultant Jun 27 '13 at 16:34
1

@EdwardBird no optimisations on => benchmark results useless. – R. Martinho Fernandes Jun 27 '13 at 16:38
@R.MartinhoFernandes What level of 'O' would you like? – FreelanceConsultant Jun 27 '13 at 16:39
@EdwardBird `O2` should be sufficient. – Mysticial Jun 27 '13 at 16:40
@Mysticial, What do the 'O' 's actually correspond to? – FreelanceConsultant Jun 27 '13 at 16:42
@EdwardBird The Os are optimization levels. – Mysticial Jun 27 '13 at 16:43
@Mysticial, Well obviously, but what does that mean? Does O2 do things that O1 doesn't, and if so what? – FreelanceConsultant Jun 27 '13 at 16:56
@EdwardBird Well yes. The higher the number, the more types of optimizations it will do. `O2` is the standard level. You can go up to `O3`, but that includes optimizations that may backfire, or may increase the size of the binary, etc... – Mysticial Jun 27 '13 at 16:59
@Mysticial: actually, the page I linked says that `-O3` is safe, but `-Ofast` contains the dangerous things. – Mooing Duck Jun 27 '13 at 17:40
@MooingDuck Oh. I didn't mean to imply that `O3` wasn't safe in that it could change functionality. I only meant that it does optimizations that might actually hurt performance (such as too much inlining). – Mysticial Jun 27 '13 at 17:44

score 2 · Answer 2 · answered Jun 27 '13 at 19:07

2

As stated in other answers, you are seeing false sharing on the result variable, but there is also one other location where this is happening. The std::vector<T>::at() function (as well as the std::vector<T>::operator[]()) access the length of the vector on each element access. To avoid this you should switch to using iterators. Also, using std::accumulate() will allow you to take advantage of optimizations in the standard library implementation you are using.

Here are the relevant parts of the code:

thread.push_back(std::thread(findmean, std::begin(data)+A, std::begin(data)+B, &(result[0])));
thread.push_back(std::thread(findmean, std::begin(data)+B, std::end(data), &(result[1])));

and

void findmean(std::vector<double>::const_iterator start, std::vector<double>::const_iterator end, double* result)
{
    *result = std::accumulate(start, end, 0.0);
}

This consistently gives me better performance for two threads on my 32-bit netbook.

answered Jun 27 '13 at 19:07

TAS

2,039
12
17

Read my comment in reply to yours. – FreelanceConsultant Jun 27 '13 at 19:09
1

False sharing actually does not occur when the variables are only being read. That's because the cacheline will just get duplicated onto the cores. It's only when there are writes that will cause the cacheline to ping-pong. – Mysticial Jun 27 '13 at 19:12
I have implemented the iterator solution, and the program is slightly faster. Thank you for your help. See the results section of my answer above. What you said about using accumulate is nonsense, however, since this is a **test case** simply for assessing the performance of threads with **regular accesses to memory over a large block of memory**. That is what the program is designed to assess. Hence I don't add numbers in a for loop. I know I could have done that. – FreelanceConsultant Jun 27 '13 at 19:20
@Mysticial Why does writing cause the problem? My understanding was a write could be done asynchronously, and the only waiting occurs when the cpu needs to access that section of memory again if a write is still pending? Clearly my understanding is not correct then? – FreelanceConsultant Jun 27 '13 at 19:22
Rated up - iterators were very useful – FreelanceConsultant Jun 27 '13 at 19:25
@EdwardBird When you write to a cacheline that is shared across multiple cores, that write will cause all other copies of the cacheline to be invalidated. Suppose a cacheline is in core A and B. A writes to it. This invalidates B's copy. So when B tries to access the cacheline, it must go grab it from A, otherwise, B will be reading an old copy of the data. If A and B are just reading the cacheline, nothing needs to be sent back and forth because the data isn't changing. – Mysticial Jun 27 '13 at 19:26
1

@EdwardBird Sorry, didn't see your answer to my comment before I posted the solution. Think I just missed it. – TAS Jun 27 '13 at 19:28
@Mysticial Ah yes, of course, thanks that helped clear it up for me. My understanding was that something else (I assume the memory controller or what used to be the Northbridge) took care of doing the write back to memory from the CPU's cache. Is that correct? (As far as I know, the Northbridge is now non existent on i7 processors. I am using a q6600 at the moment... putting it to good use... and I know it does have a NB. My understanding was the memory controller got moved to become part of the CPU to lower latency for memory read/writes.) – FreelanceConsultant Jun 27 '13 at 19:29
@EdwardBird Cache coherence protocols vary by system. Some use memory. Some will grab directly from other guy's cache. I don't believe Intel publishes what they do, but they all involve some form of invalidation based scheme. – Mysticial Jun 27 '13 at 19:30
Thanks for the recommendations, I got a significant speed up using std::accumulate() and iterators and -O3 optimization. – Plamen Dec 27 '16 at 20:21

score 1 · Answer 3 · answered Jun 27 '13 at 16:22

1

More threads doesn't mean faster! There is an overhead in creating and context-switching threads, even the hardware in which this code run is influencing the results. For such a trivial work like this it's better probably a single thread.

answered Jun 27 '13 at 16:22

Oscar

13,594
8
47
75

score 0 · Answer 4 · answered Jun 27 '13 at 16:21

0

This is probably because the cost of launching and waiting for two threads is a lot more than computing the result in a single loop. Your data size is 128MB, which is not alot for modern processors to process in a single loop.

answered Jun 27 '13 at 16:21

Kourosh

274
1
6

I would have thought a thread would take much less than a second to launch, is this not correct? – FreelanceConsultant Jun 27 '13 at 16:22
Probably, but I think the whole program should take less than a second. Also creating a thread and waiting for it to finish can result in a context switch, which is very expensive. – Kourosh Jun 27 '13 at 16:25
If you have enough RAM, try to make the array size in order of several giga bytes. – Kourosh Jun 27 '13 at 16:26
Okay, I will increase it to 2GB and report back. – FreelanceConsultant Jun 27 '13 at 16:29
I tried that, but the same result occurred. Since double takes up 64bits of memory space, I only doubled the number of items in the vector and reached 2GB of memory allocation. If I double it again, the computer starts using swap space, and then it laggggggssss – FreelanceConsultant Jun 27 '13 at 16:33
It might be false sharing as others said. btw how do you measure time? did you write your own functions? or use linux time program? – Kourosh Jun 27 '13 at 16:41
I was timing it with a stopwatch, but now I have included 'clock_t time = clock()' and all that, I will create a table of results. – FreelanceConsultant Jun 27 '13 at 16:45

Code runs 6 times slower with 2 threads than with 1

Original Problem:

Register Variable:

Register Variable and O2 Optimization:

Results so far:

4 Answers4

Linked