2

I have some expensive computation I want to divide and distribute over a set of threads. I dumbed down my code to a minimal example where this is still happening.

In short:

I have N tasks that I want to divide into "Threads" threads.

Each task is the following simple function of running a bunch of simple mathematical operations. (In practice I verify asymmetric signatures here, but I excluded that for the sake of simplification)

while (i++ < 100000)
        {
            for (int y = 0; y < 1000; y++)
            {
                sqrt(y);
            }
        }

Running the above code with 1 thread results in 0.36 seconds per operation (outermost for loop), and thus in around 36 seconds overall execution time.

Thus, parallelization seemed like an obvious way to speed it up. However, with two threads the operation time rises to 0.72 seconds completely destroying any speed up.

Adding more threads results usually in an increasingly worse performance.

I got a Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz with 6 physical cores. So I'd expect a performance boost at least using when going from 1 to 2 threads. But in fact each operation slows down when increasing the number of threads.

Am I doing something wrong?

Full code:

using namespace std;

const size_t N = 100;
const size_t Threads = 1;

atomic_int counter(0);

struct ThreadData
{
    int index;
    int count;

    ThreadData(const int index, const int count): index(index), count(count){};
};

void *executeSlave(void *threadarg)
{
    struct ThreadData *my_data;
    my_data = static_cast<ThreadData *>(threadarg);
    for( int x = my_data->index; x < my_data->index + my_data->count; x++ )
    {
        cout << "Thread: " << my_data->index <<  ": " << x << endl;

        clock_t start, end;
        start = clock();
        int i = 0;

        while (i++ < 100000)
        {
            for (int y = 0; y < 1000; y++)
            {
                sqrt(y);
            }
        }
        counter.fetch_add(1);

        end = clock();
        cout << end - start << ':' << CLOCKS_PER_SEC << ':' << (((float) end - start) / CLOCKS_PER_SEC)<< endl;
    }

    pthread_exit(NULL);
}

int main() 
{
    clock_t start, end;
    start = clock();

    pthread_t threads[Threads];
    vector<ThreadData> td;
    td.reserve(Threads);
    int each = N / Threads;
    cout << each << endl;
    for (int x = 0; x < Threads; x++) {
        cout << "main() : creating thread, " << x << endl;
        td[x] = ThreadData(x * each, each);

        int rc = pthread_create(&threads[x], NULL, executeSlave, (void *) &td[x]);

        if (rc) {
            cout << "Error:unable to create thread," << rc << endl;
            exit(-1);
        }
    }

    while (counter < N) {
        std::this_thread::sleep_for(10ms);
    }

    end = clock();

    cout << "Final:" << endl;
    cout << end - start << ':' << CLOCKS_PER_SEC << ':' << (((float) end - start) / CLOCKS_PER_SEC)
         << endl;

}
raycons
  • 735
  • 12
  • 26
  • 1
    Why not use [`std::thread`](https://en.cppreference.com/w/cpp/thread/thread) instead of `pthread`? – Fred Larson Jun 05 '20 at 12:58
  • 1
    What compiler and which C++ standard are you using? In modern C++, using `pthread_t` is a big red flag. – Eljay Jun 05 '20 at 12:58
  • 3
    Off topic but... you call `td.reserve(Threads)` but then use `td[x] = ...` without ever causing the size of the vector `td` to be set in any way. Did you mean `resize` rather than `reserve`? – G.M. Jun 05 '20 at 13:09
  • 2
    See here: [Incorrect Time in C++](https://stackoverflow.com/questions/24292507/incorrect-time-in-c). You should use `std::chrono` to measure elapsed time instead. – rustyx Jun 05 '20 at 13:11
  • You're only using one core. The reason for that is something you haven't told us. Maybe https://en.wikipedia.org/wiki/Processor_affinity – Matt Timmermans Jun 05 '20 at 13:17
  • 2
    Unrelated to your problem but you might want to consider replacing the while loop with `for (auto& t : threads) { t.join(); }` (or pthread_join if you decide to stick with pthreads). That way you can get rid of the ugly sleep. – Martin Konrad Jun 05 '20 at 13:23
  • It's because I only need K out of N signatures to be verified and some might not verify so I don't want to wait until all threads are finished before continuing, I can do that earlier and get some extra time out of it. – raycons Jun 05 '20 at 13:27
  • @rustyx but shouldn't the number of clocks spent on this while block be always the same? – raycons Jun 05 '20 at 13:28
  • 1
    Off topic, but "executeSlave" is an incredibly tasteless name. – molbdnilo Jun 05 '20 at 14:19
  • Your assignment to `td[x]` has undefined behaviour. Fix that first. – molbdnilo Jun 05 '20 at 14:22

1 Answers1

1

clock() returns approximate CPU time for the entire process.

The outermost loop does a fixed amount of work per iteration

    int i = 0;
    while (i++ < 100000)
    {
        for (int y = 0; y < 1000; y++)
        {
            sqrt(y);
        }
    }

Therefore, process CPU time reported around this loop will be proportional to the number of running threads (it still takes the same amount of time per thread, times N threads).

Use std::chrono::steady_clock to measure wall clock time instead. Note also that I/O such as std::cout takes a lot of wall clock time and is unstable. So the measured total elapsed time will be skewed due to the I/O inside.

Some additional remarks:

  1. The return value of sqrt() is never used; the compiler may eliminate the call entirely. It would be prudent to use the value in some way to be sure.

  2. void* executeSlave() isn't returning a void* pointer value (UB). It should probably be declared simply void if it returns nothing.

  3. td.reserve(Threads) reserves memory but does not allocate objects. td[x] then accesses nonexistent objects (UB). Use td.emplace_back(x * each, each) instead of td[x] = ....

  4. Not technically an issue, but it is recommended to use the standard C++ std::thread instead of pthread, for better portability.

With the following I'm seeing correct speedup proportional to the # of threads:

#include <string>
#include <iostream>
#include <vector>
#include <atomic>
#include <cmath>
#include <thread>

using namespace std;
using namespace std::chrono_literals;

const size_t N = 12;
const size_t Threads = 2;

std::atomic<int> counter(0);
std::atomic<int> xx{ 0 };

void executeSlave(int index, int count, int n)
{
    double sum = 0;
    for (int x = index; x < index + count; x++)
    {
        cout << "Thread: " << index << ": " << x << endl;
        auto start = std::chrono::steady_clock::now();
        for (int i=0; i < 100000; i++)
        {
            for (int y = 0; y < n; y++)
            {
                sum += sqrt(y);
            }
        }
        counter++;

        auto end = std::chrono::steady_clock::now();
        cout << 1e-6 * (end - start) / 1us << " s" << endl;
    }
    xx += (int)sum; // prevent optimization

}

int main()
{
    std::thread threads[Threads];
    int each = N / Threads;
    cout << each << endl;
    auto start = std::chrono::steady_clock::now();
    for (int x = 0; x < Threads; x++) {
        cout << "main() : creating thread, " << x << endl;
        threads[x] = std::thread(executeSlave, x * each, each, 100);
    }

    for (auto& t : threads) {
        t.join();
    }

    auto end = std::chrono::steady_clock::now();

    cout << "Final:" << endl;
    cout << 1e-6 * (end - start) / 1us << " s" << endl;

}
rustyx
  • 80,671
  • 25
  • 200
  • 267