Making a for loop faster by splitting it in threads

Question

Before I start, let me say that I've only used threads once when we were taught about them in university. Therefore, I have almost zero experience using them and I don't know if what I'm trying to do is a good idea.

I'm doing a project of my own and I'm trying to make a for loop run fast because I need the calculations in the loop for a real-time application. After "optimizing" the calculations in the loop, I've gotten closer to the desired speed. However, it still needs improvement.

Then, I remembered threading. I thought I could make the loop run even faster if I split it in 4 parts, one for each core of my machine. So this is what I tried to do:

void doYourThing(int size,int threadNumber,int numOfThreads) {
    int start = (threadNumber - 1) * size / numOfThreads;
    int end = threadNumber * size / numOfThreads;
    for (int i = start; i < end; i++) {
        //Calculations...
    }
}
int main(void) {
    int size = 100000;
    int numOfThreads = 4;

    int start = 0;
    int end = size / numOfThreads;
    std::thread coreB(doYourThing, size, 2, numOfThreads);
    std::thread coreC(doYourThing, size, 3, numOfThreads);
    std::thread coreD(doYourThing, size, 4, numOfThreads);

    for (int i = start; i < end; i++) {
        //Calculations...
    }
    coreB.join();
    coreC.join();
    coreD.join();
}

With this, computation time changed from 60ms to 40ms.

Questions:

1)Do my threads really run on a different core? If that's true, I would expect a greater increase in speed. More specifically, I assumed it would take close to 1/4 of the initial time.

2)If they don't, should I use even more threads to split the work? Will it make my loop faster or slower?

How did you compile this, with what compiler and what flags? Is the calculation meaningfully long, enough to overcome the threading overhead? How are the results merged? — François Andrieux, Oct 31 '20 at 13:06
@FrançoisAndrieux I'm using visual studio if that answers your first question. I don't know which calculations are considered long but the loop initially took about 60ms to complete. As for the results, I just use a breakpoint before and after this whole process. — John Katsantas, Oct 31 '20 at 13:19
Do you know about Debug and Release-Builds and how much the compiler can optimze your code? If you've meassured the time in a debug build the meassurement is almost meaningless. Switch to Release and try it again. — Lukas-T, Oct 31 '20 at 13:33
You can't reliably measure execution time inside the debugger. You need to measure the time yourself in an optimized build. — molbdnilo, Oct 31 '20 at 13:43

score 1 · Accepted Answer · answered Oct 31 '20 at 13:21

(1). The question @François Andrieux asked is good. Because in the original code there is a well-structured for-loop, and if you used -O3 optimization, the compiler might be able to vectorize the computation. This vectorization will give you speedup.

Also, it depends on what is the critical path in your computation. According to Amdahl's law, the possible speedups are limited by the un-parallelisable path. You might check if the computation are reaching some variable where you have locks, then the time could also spend to spin on the lock.

(2). to find out the total number of cores and threads on your computer you may have lscpu command, which will show you the cores and threads information on your computer/server

(3). It is not necessarily true that more threads will have a better performance

Keijo · Answer 2 · 2020-11-10T21:45:20.237

There is a header-only library in Github which may be just what you need. Presumably your doYourThing processes an input vector (of size 100000 in your code) and stores the results into another vector. In this case, all you need to do is to say is

auto vectorOut = Lazy::runForAll(vectorIn, myFancyFunction);

The library will decide how many threads to use based on how many cores you have.

On the other hand, if the compiler is able to vectorize your algorithm and it still looks like it is a good idea to split the work into 4 chunks like in your example code, you could do it for example like this:

#include "Lazy.h"

void doYourThing(const MyVector& vecIn, int from, int to, MyVector& vecOut)
{
  for (int i = from; i < to; ++i) {
    // Calculate vecOut[i]
  }
}

int main(void) {
  int size = 100000;
  MyVector vecIn(size), vecOut(size)
  // Load vecIn vector with input data...
  
  Lazy::runForAll({{std::pair{0, size/4}, {size/4, size/2}, {size/2, 3*size/4}, {3*size/4, size}},
    [&](auto indexPair) { 
      doYourThing(vecIn, indexPair.first, indexPair.second, vecOut);
    });
  // Now the results are in vecOut                                    
}

README.md gives further examples on parallel execution which you might find useful.

Making a for loop faster by splitting it in threads

2 Answers2