1

I am new to multithreaded programming in C++. I wrote a simple piece of code, which I will paste below. When run in two threads the code finishes barely faster than when run in single threaded. I have come across other similar questions, however they are different as I have no shared resources which both threads will need to access:

The code is below:

#include <iostream>
#include <thread>
#include <mutex>
#include <chrono>

using namespace std;

typedef unsigned long long ull;

ull oddSum = 0;
ull evenSum = 0;

void addOdd(){
    for(int i = 1; i <= 1999999999; i++){
        if(i % 2 == 1)
            oddSum += i;
    }
}

void addEven(){
    for(int i = 1; i <= 1999999999; i++){
        if(i % 2 == 0)
            evenSum += i;
    }
}

int main(){
    auto startTime = std::chrono::high_resolution_clock::now();

    //Two threads
    std::thread t1(addEven);    //launch the two threads to run
    std::thread t2(addOdd); 

    t1.join();
    t2.join();    //wait for both to finish

    //One thread
    //addEven();
    //addOdd();

    auto stopTime = std::chrono::high_resolution_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::microseconds>(stopTime - startTime);

    cout << "Odd Sum: " << oddSum << endl;
    cout << "Even Sum: " << evenSum << endl;

    cout << elapsed.count()/(double)1000000 << endl;

    return 0;
}

When I run with a single thread, average on 10 runs is around 7.3 seconds.

When I run with two threads, average on 10 runs is around 6.8 seconds.

Since the vast majority of the time is taken by the loops in the functions, I would believe that running two threads in parallel, each with a single function, would halve the time it takes to run.

Note 1: I know the time cannot possibly be properly halved, a more educated guess might be a run time of up to 5 secs in two threads. I understand the creation of the thread objects has its own overhead.

Note 2: Maybe I am missing something, but the threads are not accessing any shared location between the two.

Any idea is welcome. I am familiar with the theory behind concurrent programming, and I am now just starting to gain some hands-on experience. I have an intel i7 with 4 cores.

3 Answers3

5

Because you have two global variables declared right next to each other, you're probably running into false sharing.

Zuodian Hu
  • 979
  • 4
  • 9
5

When frequently accessing variables from different threads, you should ensure they are on different cache-lines (which are 64 bytes wide on x86).

You can do this by aligning the variables:

ull oddSum __attribute__((aligned(64))) = 0;
ull evenSum  __attribute__((aligned(64))) = 0;

Failure to do this effectively serializes the writes, as the cache line will only be modified by one CPU at a time.

Aligning the variables reduces the runtime by 30% for me in the mutli-thread case.

As @walnut mentioned in his comment, this can be done in a portable way if your compiler supports C++17:

#include <new>
alignas(std::hardware_destructive_interference_size) ull oddSum = 0;
alignas(std::hardware_destructive_interference_size) ull evenSum  = 0;
Mikel Rychliski
  • 3,455
  • 5
  • 22
  • 29
  • This solved my problem. Will definitely have to do more reading on this matter. Edit: found this post also, slightly more in depth explanation https://stackoverflow.com/questions/8469427/how-and-when-to-align-to-cache-line-size – small_cat_destroyer Apr 15 '20 at 00:11
  • 1
    We have `alignas` since C++11 and `std::hardware_destructive_interference_size` since C++17. You don't need to use compiler-specific attributes and magic constants. – walnut Apr 15 '20 at 00:20
  • What CPU and compiler are you using? Also what optimization level? – WBuck Apr 15 '20 at 03:08
  • @WBuck: GCC 4.8, no optimization (which I guess makes the measurements not very meaningful...). With optimization the problem doesn't occur because the compiler doesn't write the variable back to memory at each iteration. I'm guessing original poster didn't have optimization enabled based on their benchmark numbers (with optimization my 10+ year old Xeon 5150 is much faster then original numbers). – Mikel Rychliski Apr 15 '20 at 03:20
  • @WBuck I am using g++ version 5.1.0 and I have an Intel i7-8550U with 4 cores. I haven't passed an optimization flag (mostly since I do not know what it does *specifically*). As I mentioned, I am only trying to gain some practical experience (learn and understand why program behaves a particular way, what is the reasoning behind it), my aim is not to make this particular program run as fast as possible. Thank you for the input though it was very helpful – small_cat_destroyer Apr 15 '20 at 17:33
1

You did no optimizations for two threads. Below is the single-thread solution. The thread iterated through 1999999999 numbers.

for(int i = 1; i <= 1999999999; i++){
    if(i % 2 == 1)
        oddSum += i;
    else
        evenSum += i;
}

Below are two loops for two threads, each of them iterates through the same numbers. So each thread performs almost the same amount of work.

for(int i = 1; i <= 1999999999; i++){
    if(i % 2 == 1)
        oddSum += i;
}

for(int i = 1; i <= 1999999999; i++){
    if(i % 2 == 0)
        evenSum += i;
}

The optimized solution for two threads, sharing 1999999999 iterations between two threads:

for(int i = 1; i <= 1999999999; i += 2){
    oddSum += i;
}


for(int i = 2; i <= 1999999999; i += 2){
    evenSum += i;
}

Now each thread performed twice less amount of work.

273K
  • 29,503
  • 10
  • 41
  • 64
  • Thank you for the suggestion. However my question did not regard optimization (although I can see how your code is an improvement). I was simply asking for the difference in runtimes between the code when run in 1 vs 2 threads. – small_cat_destroyer Apr 15 '20 at 00:10
  • I have explained it in the answer to you. Each thread from your solution does the same amount of work as the single-thread solution. Thus there is no difference in time intervals. – 273K Apr 15 '20 at 00:13
  • If I run your improved code on 1 thread, there would be no difference from running still your code in 2 threads. My problem was in the fact that the two global variables were apparently in the same cache line, as suggested in the accepted answer. – small_cat_destroyer Apr 15 '20 at 00:15
  • Don't say, try it to be sure. Single thread will iterate through 1999999999 numbers, two-threads will iterate through 1999999999/2 numbers in parallel, 2 times faster. – 273K Apr 15 '20 at 00:17
  • "Splitting the work" the way it was done in the question does have the potential to reduce the work done a bit (because the writes can be done concurrently). But obviously we shouldn't expect half the runtime when we're still doing the same number of compare/shift operations on each thread. – Mikel Rychliski Apr 15 '20 at 00:24
  • @s.m. my intention was as follows: In single thread I have two loops, each executing 1999999999 times, so in total 1999999999 * 2 for the single thread. I was trying to "halve" the execution time by having two threads, which each do 1999999999 iterations but in parallel. While I understand the optimization your code introduces, it was not the specific way I wanted to go about it. – small_cat_destroyer Apr 15 '20 at 17:25
  • Ok. It is not clear from the question that you run two loops in a single thread instead of two conditions in one loop – 273K Apr 15 '20 at 17:50