Why two/four threads is even slower than single thread doing vector inner_product in C++11 multithread coing?

Question

Refering to a c++11 multi-threading example, I try to ues multi-threading to compute the vector dot_product result. The basic idea here is that we break the vector into two or four parts, and concurrent compute the partial_sum of each part. And do the summation after synchronize the two threads taks. Here, I only use CPU and RAM resources. Besides, I try create a big vector in order to cover the thread schedule ovehead.

The problem is here that two/four threads is even slower than single thread. But the CPU utilization is much higher when using two threads. Hence, I believe two physical cores are used by the program.

The platform and running time results are given below. Some people said that running time is a little bit higher. I am not sure if the efficiency is related to OS. I haven't tested the code in Linux, anyone help me do the test?

Any help would be greatly appreciated.

Here is my thread implementation:

#include<iostream>
#include<thread>
#include<vector>
#include<mutex>
#include<ctime>
#include<numeric>
#include<iterator>

using namespace std;
vector<int> boundary(int num, int parts)
{
    vector<int >bounds;
    int delta = num / parts;
    int remainder = num % parts;
    int prev = 0, next = 0;
    bounds.push_back(prev);

    for(int i = 0; i < parts; i++)
    {
        next = prev + delta;
        if(i == parts - 1)
            next = next + remainder;
        bounds.push_back(next);
        prev = next;
    } 

    return bounds;
}

void dot_product(const vector<int>& v1, const vector<int>& v2, int& result, int L, int R, int tid)
{
    int partial_sum = 0;

    for(int i = L; i < R; i++)
        partial_sum += v1[i] * v2[i];

    //lock_guard<mutex> lock(barrier);
    //cout << "tid: " << tid<< endl;
    result = partial_sum;
}

int main(int argc, char* argv[])
{
    clock_t start, end;
    int numOfElement = 500000000;
    // Change the thread number here
    int numOfThread = 2;
    int result[numOfThread] = {0};
    vector<thread> threads;

    // Fill two vectors with some values 
    vector<int> v1(numOfElement, 1), v2(numOfElement, 2);

    // Split numOfElement into nr_threads parts
    vector<int> limits = boundary(numOfElement, numOfThread);

    start = clock();
    // Launch multi_threads:
    for(int i = 0; i < numOfThread; i++)
        threads.push_back(thread(dot_product, ref(v1), ref(v2), ref(result[i]), limits[i], limits[i+1], i)); 

    // Join the threads with the main thread    
    for(auto &t:threads)
        t.join();

    int sum = accumulate(result, result+numOfThread, 0);
    end = clock();  
    //cout << limits[0] <<" " << limits[1]<<" "<<limits[2]<<endl;
    cout << "results: " << sum << " time elapsed: "<< double(end - start) / CLOCKS_PER_SEC << endl;
    return 0;
}

Platform:

OS: Win8 64bit

CPU: I3-3220 (2C4T)

RAM: 12G

IDE: Dev-C++ 5.11, TDM-GCC 4.9.2 release

Results so far:

1 Thread: 14.42 seconds, CPU: 60%, RAM: 3.82G(program only, revised it)

2 Threads: 19.65 seconds, CPU: 82%, RAM: 3.82G(program only, revised it)

4 Threads: 22.33 seconds, CPU: 99%, RAM: 3.82G(program only, revised it)

Update: Referring to this link, I set the -O2 optimation for GCC at the expense of compilation time reported by this link. Now the result seems normal as expectation, which is comparable to what @yzt reported below. Further, I replace the clock with chrono for high_resolution timming.

New results:

1 Thread: 0.57 seconds

2 Threads: 0.31 seconds

4 Threads: 0.28 seconds

Seem that 2 thread almost approach the optimal speed in my CPU, I guess the reason behind this is that I3 only get 2 physical cores and 4 logical cores.

Consider using a Task-Parallel library instead of using raw threads directly, because these libraries intelligently scale work according to the capabilities of the machine. If you're using Visual C++ on Windows consider using PPL ( https://msdn.microsoft.com/en-us/library/dd492418.aspx ), then please report back your findings. — Dai, Sep 03 '15 at 05:44
Debug or release? On Windows there's a major difference in vector implementation, and the debug one may be locking mutexes / checking guards to see that there's no concurrent problematic access slowing down your test. Based on raw performance numbers I'd expect your full test to finish in about 1-2 seconds, so the 15 for one core is already very high. — dascandy, Sep 03 '15 at 05:46
There could also be some cache coherence issues. The threads are operating on some of the same data blocks. That may mean they are not able to use caching as aggressively. — juanchopanza, Sep 03 '15 at 05:49
The data size is far larger than any of the caches, so in effect it's all down to the speed of the SDRAM in your machine. With two or more threads accessing different memory areas all the time the memory controllers simply cannot run the SDRAM efficiently, so it slows down. Memory is not the magical limitless resource that many assume it to be; you have to consider the memory pressure resulting from an algorithm, cache sizes, etc. — bazza, Sep 03 '15 at 05:56
@dascandy Thanks in advance. I use release actually. Can you help test my code in Linux platform? — user3409554, Sep 03 '15 at 06:02
Btw, try to use vector reserve so it won't reallocate space. — , Sep 03 '15 at 06:11
@Satus: AFAICT, there is no vector resizing in the hot parts of the code. — yzt, Sep 03 '15 at 06:13
The code is doing 500M multiplies and adds, plus reading 2x4x500M=4GB of memory. Even the single-threaded version seems too slow, let alone the multi-threaded ones. — yzt, Sep 03 '15 at 06:17
Note that CPU utilization doesn't tell you anything about the **physical** number of cores being used. — user541686, Sep 03 '15 at 06:19
@Mehrdad Thanks!! I open the task moniter to check the 4 Core usage. When I use four threads, the 4 Core is highly used. CPU is I3-3220. Did you mean that I haven't acutally used the physical core? So how can I enable that? — user3409554, Sep 03 '15 at 06:20
Oh, I said that because you said you're only running 2 threads. 4 threads would obviously use all the cores, so if you're seeing 100% on all of them then it's obviously using all the physical cores. — user541686, Sep 03 '15 at 06:22
@yzt Thanks! Can you rougly give the running time for reference? — user3409554, Sep 03 '15 at 06:29
Not reproducible here with TDM-GCC 5.1.0. Note that the `clock` function according to the standard should return **consumed CPU time**, but on Windows it returns **wall time**. If you are using C++11, you are better off using `std::chrono`. — n. m. could be an AI, Sep 03 '15 at 06:33
@n.m.: Isn't the standard definition of `clock` ill-defined for a multithreaded program though? The Windows one seems well-defined. — user541686, Sep 03 '15 at 06:45
@yzt Visual C++ uses a checked STL implementation which validates iterators before every access. Last time I checked, this was always enabled, even in release mode. Maybe this could explain the low performance of the single-threaded version. Try disabling it with `#define _SCL_SECURE_NO_WARNINGS` and `#define _HAS_ITERATOR_DEBUGGING 0`. — Jens, Sep 03 '15 at 06:50
@Jens Not on the latest VS versions. Starting from VS2010 it is disabled for released mode. — Jagannath, Sep 03 '15 at 06:51
I can't reproduce your problem. The times on my system are: 0.535, 0.383 and 0.287 seconds for 1, 2 and 4 threads. I'm on an old i7-920, Win7 64-bit and VS2015. — yzt, Sep 03 '15 at 06:51
@yzt same here, everything works just fine here as well, both in debug and release mode (debug being slower ofc) — , Sep 03 '15 at 06:53
Any anti-virus program is scanning it while executing this program ? — Jagannath, Sep 03 '15 at 06:53
@Jens: That hasn't been true in release mode since VS2010. Also, `_SCL_SECURE_NO_WARNINGS` has no bearing on this. *And*, the OP is using GCC. — yzt, Sep 03 '15 at 06:54
@yzt Ok, my VC++ skills are little bit rusty. But thanks for the info. — Jens, Sep 03 '15 at 06:58
@user3409554: Your code seems fine. The operations you are doing are too light (only a multiply and add) but you are doing everything correctly. There are no false sharing, no cache thrashing, no locking, no pointer aliasing that might result in false sharing or thrashing, no nothing! Your compiler or runtime library might be doing something fishy though, because I'm seeing numbers that are at least 30x better than yours on VC. Can you check using another compiler? — yzt, Sep 03 '15 at 07:00
@Mehrdad Windows function may be well-defined, but its definition flatly contradicts the standard. POSIX has no problem with it, it reports total CPU time of all used CPUs (thus the result may be 40 seconds for a program that runs for 10 seconds according to the wall clock and occupies 4 CPUs) — n. m. could be an AI, Sep 03 '15 at 07:16
@n.m.: Does it really? The notion of time is already ill-defined in C. There is no guarantee that calling `clock()` in the beginning of the program should result in any particular value (such as zero), and furthermore there is no guarantee that spawning a second thread should make `clock` increase twice as fast. If you're looking at it strictly, it's not a contradiction, because it's not violating any guarantees. If you're looking at it loosely, I would argue the Windows definition is more useful anyway. — user541686, Sep 03 '15 at 07:26
@user3409554: Forget the `clock()` function! Look at your watch... Does your program take 22 seconds to run with 4 threads? If in doubt, use this code to read current time: `chrono::high_resolution_clock::now()` and this code to calculate the difference between two time points: `chrono::duration_cast>(t1 - t0).count()`. They need the standard header ``. — yzt, Sep 03 '15 at 07:29
@Mehrdad The term "CPU time" is unambigous and well-understood. All POSIX systems in existence do what I have said. When you rent CPU time on a time-shared system, you get billed according to this notion of CPU time. — n. m. could be an AI, Sep 03 '15 at 07:29
@n.m. You're missing the point. The standard only describes the behavior on an abstract machine, not a physical machine. With regards to that, time is ill-defined in C. Specifically, a machine that decides to slow down or speed up unpredictably for each thread can still be in perfect compliance with the standard. So as far as the standard is concerned, the behavior your program sees in Windows is in perfect compliance. It's unfortunate that it's unintuitive, but that doesn't make it contradictory. — user541686, Sep 03 '15 at 07:36
It has already been touched upon, but I suggest finding out the cache line length of your CPU, then pre-read two chunks of the vectors into two cache lines and do the dot product on those chunks. Then read in two new chunks, and so on — Erik Alapää, Sep 03 '15 at 07:47
@Mehrdad The C standard says: "The clock function determines the processor time used." and "The clock function returns the implementation’s best approximation to the processor time used by the program since the beginning of an implementation-defined era related only to the program invocation". Windows has a perfectly good approximation of the processor time, obtained via `GetProcessTimes`, that is consistent with the POSIX interpretation. One can see this value in the "CPU time" column in the task manager. Why the current implementation of `clock` is considered better than that is beyond me. — n. m. could be an AI, Sep 03 '15 at 08:06
@n.m.: This is getting very language-lawyery, but doesn't "the implementation" refer to the C standard library implementation rather than the OS implementation? As a language lawyer (which is the case here, since we're talking about compliance) we can argue that its best approximation is indeed what it is returning, irrespective of whether the OS has a more accurate function available. As for usefulness, I consider `clock` more useful because, well, I've had more use cases for it (such as benchmarking, e.g. for this very question!) whereas I've rarely needed the other definition. — user541686, Sep 03 '15 at 08:16
@n.m.: Again, to put it another way: no valid C program can use standard library features to correctly distinguish between a "correct" and an "incorrect" implementation, so as far as the abstract machine is concerned, both behaviors are correct, because the correctness of the specification of the abstract machine only depends on observable (I/O) behavior, not timing. — user541686, Sep 03 '15 at 08:19
@Mehrdad you have all the information and can decide for yourself. I'm not going to convince you. — n. m. could be an AI, Sep 03 '15 at 08:41
@yzt Thanks so much! I will check my complier. I think GCC in DevC++ is different from VC, right? — user3409554, Sep 03 '15 at 10:19
@yzt [link](http://stackoverflow.com/questions/17348228/code-runs-6-times-slower-with-2-threads-than-with-1) I try the method in this link, adding the -O2 optimation in GCC complier. The time now is comparable to yours. — user3409554, Sep 03 '15 at 10:30
@user3409554: In one of the first comments, you said that you are using "release", when someone asked whether you are in "debug" or "release" mode. You made a mistake there. If you are not optimizing the code (and/or linking with debug version of the standard library) then you are in "debug" mode, and not in "release" mode. However, that lingo technically applies only to Visual Studio, but I guess it is expected that users of other compilers at least know what people are referring to when saying "building in debug or release mode". — yzt, Sep 03 '15 at 19:49
@user3409554: Just know that all of this basically useless discussion wouldn't have happened if you hadn't said this: "I use release actually." — yzt, Sep 03 '15 at 19:51
@yzt Sorry for my confusion. Because the complier in DEV-C++ I choose is GCC 4.9.2 64 bit Release. I am not sure that is still in debug mode. Further, I said I am using GCC not VS — user3409554, Sep 04 '15 at 10:01

Why two/four threads is even slower than single thread doing vector inner_product in C++11 multithread coing?

0 Answers0