13

Original Problem:

So I have written some code to experiment with threads and do some testing.

The code should create some numbers and then find the mean of those numbers.

I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.

void findmean(std::vector<double>*, std::size_t, std::size_t, double*);


int main(int argn, char** argv)
{

    // Program entry point
    std::cout << "Generating data..." << std::endl;

    // Create a vector containing many variables
    std::vector<double> data;
    for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);

    // Calculate mean using 1 core
    double mean = 0;
    std::cout << "Calculating mean, 1 Thread..." << std::endl;
    findmean(&data, 0, data.size(), &mean);
    mean /= (double)data.size();

    // Print result
    std::cout << "  Mean=" << mean << std::endl;

    // Repeat, using two threads
    std::vector<std::thread> thread;
    std::vector<double> result;
    result.push_back(0.0);
    result.push_back(0.0);
    std::cout << "Calculating mean, 2 Threads..." << std::endl;

    // Run threads
    uint32_t halfsize = data.size() / 2;
    uint32_t A = 0;
    uint32_t B, C, D;
    // Split the data into two blocks
    if(data.size() % 2 == 0)
    {
        B = C = D = halfsize;
    }
    else if(data.size() % 2 == 1)
    {
        B = C = halfsize;
        D = hsz + 1;
    }

    // Run with two threads
    thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
    thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));

    // Join threads
    thread[0].join();
    thread[1].join();

    // Calculate result
    mean = result[0] + result[1];
    mean /= (double)data.size();

    // Print result
    std::cout << "  Mean=" << mean << std::endl;

    // Return
    return EXIT_SUCCESS;
}


void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
    for(uint32_t i = 0; i < length; i ++) {
        *result += (*datavec).at(start + i);
    }
}

I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.

Register Variable:

Several people have suggested making a local variable for the function 'findmean'. This is what I have done:

void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
    holding += (*datavec).at(start + i);
}
*result = holding;
}

I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?

Register Variable and O2 Optimization:

I have set the optimization to 'O2' - I will create a table with the results.

Results so far:

Original Code with no optimization or register variable: 1 thread: 4.98 seconds, 2 threads: 29.59 seconds

Code with added register variable: 1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds

With reg variable and -O2 optimization: 1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?

With Dameon's suggestion, which was to put a large block of memory in between the two result variables: 1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds

With TAS 's suggestion of using iterators to access contents of the vector: 1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds

Same as above on Core i7 920 (single channel memory 4GB): 1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds

Same as above on Core i7 920 (dual channel memory 2x2GB): 1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds

Community
  • 1
  • 1
FreelanceConsultant
  • 13,167
  • 27
  • 115
  • 225
  • 1
    Since the threads access memory concurrently at exactly 128 MiB offset, this might be false sharing. – Damon Jun 27 '13 at 16:20
  • That sounds a little slow. In experiments with multiple processors back ca 1982 the app only ran about 30% slower on 2 processors vs 1. – Hot Licks Jun 27 '13 at 16:21
  • @Damon, Could you explain that a bit? Do you mean there is an overlap between summation at 128MB offset? What is false sharing? – FreelanceConsultant Jun 27 '13 at 16:21
  • also threads are expensive to create and dispose, if you are not doing complex calculation they are a waste – LemonCool Jun 27 '13 at 16:22
  • Actually, it's true sharing -- of the cache line. It's just not visible at the HLL programming level. – Hot Licks Jun 27 '13 at 16:23
  • 1
    @Edward Bird: It basically means the threads compete for the same cache lines (this depends on associativity and cache size, it's a bit complicated... but I think there's a comprehensive formula at Agner Fog's site, IIRC). You might try dividing work differently, for example in somewhat smaller (say, 128k or 512k) blocks. If that "magically" solves the performance problem, you know. – Damon Jun 27 '13 at 16:26
  • How many cores on your machine? Threads only give speedup if they let you put otherwise idle hardware to work. – Mike Dunlavey Jun 27 '13 at 16:37
  • @MikeDunlavey This is running on an old Q6600 - 4 Cores, 4 Threads – FreelanceConsultant Jun 27 '13 at 16:40
  • @Damon, So you are suggesting divide the work by running 2 threads which process perhaps 1/16 of the data at a time? – FreelanceConsultant Jun 27 '13 at 16:41
  • @EdwardBird: That'd be a way to rule out (or prevent) false sharing, yes. – Damon Jun 27 '13 at 17:00
  • 1
    @Damon Could you explain to me why that is likely to prevent false sharing, since result[0] and result[1] will still be adjacent? – FreelanceConsultant Jun 27 '13 at 17:03
  • @EdwardBird: That's a good point, you will in addition want to insert some padding in between those two (64 bytes will do), as if these are in the same cache line they'll of course cause false sharing for every single write, too (if the compiler doesn't keep the value in a register). – Damon Jun 27 '13 at 17:07
  • @Damon, Ah, right I think I get you... So something like: An array of return values, where I use only the first and last elements, which are spread quite far apart (in this case 64 bytes) ? – FreelanceConsultant Jun 27 '13 at 17:10
  • @Damon, I tried that but it only made the execution (slightly, might not be significant) slower. – FreelanceConsultant Jun 27 '13 at 17:13
  • @EdwardBird In any case, padding the 64 bytes, or using a temp variable with both get rid of the false sharing. Using a temp variable eliminates the memory accesses. Padding puts them on different cachelines. Your 2 thread slowdown is probably just due to overhead from having two threads or from the two threads contending for the same memory channel. But either way, it's normal to see slowdowns once you're used more threads than is optimal. "Optimal" here seems to be 1 thread. – Mysticial Jun 27 '13 at 17:47
  • Have you tried replacing the loop with `std::accumulate`? – TAS Jun 27 '13 at 18:15
  • How this is a false sharing if he iterates over two halves of large vector in different threads? – SChepurin Jun 27 '13 at 19:01
  • @SChepurin - Originally I was writing to a vector of length 2 (2 doubles) - see the original code above – FreelanceConsultant Jun 27 '13 at 19:04
  • @TAS Haha, very funny - the point was for this to serve as a test for using two threads with vectors. The point being a lot of the stuff I write requires accessing all elements inside a vector and doing some arithmetic with them. – FreelanceConsultant Jun 27 '13 at 19:08

4 Answers4

20

Why are 2 threads 6x slower than 1 thread?

You are getting hit by a bad case of false sharing.

After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?

You are bottlenecked by your memory bandwidth.


False Sharing:

The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.

Each thread is running this loop:

for(uint32_t i = 0; i < length; i ++) {
    *result += (*datavec).at(start + i);
}

And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.

Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.

Memory Bandwidth:

Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.

Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.

In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.

Mysticial
  • 464,885
  • 45
  • 335
  • 332
2

As stated in other answers, you are seeing false sharing on the result variable, but there is also one other location where this is happening. The std::vector<T>::at() function (as well as the std::vector<T>::operator[]()) access the length of the vector on each element access. To avoid this you should switch to using iterators. Also, using std::accumulate() will allow you to take advantage of optimizations in the standard library implementation you are using.

Here are the relevant parts of the code:

thread.push_back(std::thread(findmean, std::begin(data)+A, std::begin(data)+B, &(result[0])));
thread.push_back(std::thread(findmean, std::begin(data)+B, std::end(data), &(result[1])));

and

void findmean(std::vector<double>::const_iterator start, std::vector<double>::const_iterator end, double* result)
{
    *result = std::accumulate(start, end, 0.0);
}

This consistently gives me better performance for two threads on my 32-bit netbook.

TAS
  • 2,039
  • 12
  • 17
  • Read my comment in reply to yours. – FreelanceConsultant Jun 27 '13 at 19:09
  • 1
    False sharing actually does not occur when the variables are only being read. That's because the cacheline will just get duplicated onto the cores. It's only when there are writes that will cause the cacheline to ping-pong. – Mysticial Jun 27 '13 at 19:12
  • I have implemented the iterator solution, and the program is slightly faster. Thank you for your help. See the results section of my answer above. What you said about using accumulate is nonsense, however, since this is a **test case** simply for assessing the performance of threads with **regular accesses to memory over a large block of memory**. That is what the program is designed to assess. Hence I don't add numbers in a for loop. I know I could have done that. – FreelanceConsultant Jun 27 '13 at 19:20
  • @Mysticial Why does writing cause the problem? My understanding was a write could be done asynchronously, and the only waiting occurs when the cpu needs to access that section of memory again if a write is still pending? Clearly my understanding is not correct then? – FreelanceConsultant Jun 27 '13 at 19:22
  • Rated up - iterators were very useful – FreelanceConsultant Jun 27 '13 at 19:25
  • @EdwardBird When you write to a cacheline that is shared across multiple cores, that write will cause all other copies of the cacheline to be invalidated. Suppose a cacheline is in core A and B. A writes to it. This invalidates B's copy. So when B tries to access the cacheline, it must go grab it from A, otherwise, B will be reading an old copy of the data. If A and B are just reading the cacheline, nothing needs to be sent back and forth because the data isn't changing. – Mysticial Jun 27 '13 at 19:26
  • 1
    @EdwardBird Sorry, didn't see your answer to my comment before I posted the solution. Think I just missed it. – TAS Jun 27 '13 at 19:28
  • @Mysticial Ah yes, of course, thanks that helped clear it up for me. My understanding was that something else (I assume the memory controller or what used to be the Northbridge) took care of doing the write back to memory from the CPU's cache. Is that correct? (As far as I know, the Northbridge is now non existent on i7 processors. I am using a q6600 at the moment... putting it to good use... and I know it does have a NB. My understanding was the memory controller got moved to become part of the CPU to lower latency for memory read/writes.) – FreelanceConsultant Jun 27 '13 at 19:29
  • @EdwardBird Cache coherence protocols vary by system. Some use memory. Some will grab directly from other guy's cache. I don't believe Intel publishes what they do, but they all involve some form of invalidation based scheme. – Mysticial Jun 27 '13 at 19:30
  • Thanks for the recommendations, I got a significant speed up using std::accumulate() and iterators and -O3 optimization. – Plamen Dec 27 '16 at 20:21
1

More threads doesn't mean faster! There is an overhead in creating and context-switching threads, even the hardware in which this code run is influencing the results. For such a trivial work like this it's better probably a single thread.

Oscar
  • 13,594
  • 8
  • 47
  • 75
0

This is probably because the cost of launching and waiting for two threads is a lot more than computing the result in a single loop. Your data size is 128MB, which is not alot for modern processors to process in a single loop.

Kourosh
  • 274
  • 1
  • 6
  • I would have thought a thread would take much less than a second to launch, is this not correct? – FreelanceConsultant Jun 27 '13 at 16:22
  • Probably, but I think the whole program should take less than a second. Also creating a thread and waiting for it to finish can result in a context switch, which is very expensive. – Kourosh Jun 27 '13 at 16:25
  • If you have enough RAM, try to make the array size in order of several giga bytes. – Kourosh Jun 27 '13 at 16:26
  • Okay, I will increase it to 2GB and report back. – FreelanceConsultant Jun 27 '13 at 16:29
  • I tried that, but the same result occurred. Since double takes up 64bits of memory space, I only doubled the number of items in the vector and reached 2GB of memory allocation. If I double it again, the computer starts using swap space, and then it laggggggssss – FreelanceConsultant Jun 27 '13 at 16:33
  • It might be false sharing as others said. btw how do you measure time? did you write your own functions? or use linux time program? – Kourosh Jun 27 '13 at 16:41
  • I was timing it with a stopwatch, but now I have included 'clock_t time = clock()' and all that, I will create a table of results. – FreelanceConsultant Jun 27 '13 at 16:45