Why isn't my 4 thread implementation faster than the single thread one?

Question

I don't know much about multi-threading and I have no idea why this is happening so I'll just get to the point.

I'm processing an image and divide the image in 4 parts and pass each part to each thread(essentially I pass the indices of the first and last pixel rows of each part). For example, if the image has 1000 rows, each thread will process 250 of them. I can go in details about my implementation and what I'm trying to achieve in case it can help you. For now I provide the code executed by the threads in case you can detect why this is happening. I don't know if it's relevant but in both cases(1 thread or 4 threads) the process takes around 15ms and pfUMap and pbUMap are unordered maps.

void jacobiansThread(int start, int end,vector<float> &sJT,vector<float> &sJTJ) {

    uchar* rgbPointer;
    float* depthPointer;
    float* sdfPointer;
    float* dfdxPointer; float* dfdyPointer;
    float fov = radians(45.0);
    float aspect = 4.0 / 3.0;
    float focal = 1 / (glm::tan(fov / 2));
    float fu = focal * cols / 2 / aspect;
    float fv = focal * rows / 2;

    float strictFu = focal / aspect;
    float strictFv = focal;

    vector<float> pixelJacobi(6, 0);

    for (int y = start; y <end; y++) {
        rgbPointer = sceneImage.ptr<uchar>(y);
        depthPointer = depthBuffer.ptr<float>(y);
        dfdxPointer = dfdx.ptr<float>(y);
        dfdyPointer = dfdy.ptr<float>(y);
        sdfPointer = sdf.ptr<float>(y);
        for (int x = roiX.x; x <roiX.y; x++) {
            float deltaTerm;// = deltaPointer[x];
            float raw = sdfPointer[x];
            if (raw > 8.0)continue;
            float dirac = (1.0f / float(CV_PI)) * (1.2f / (raw * 1.44f * raw + 1.0f));
            deltaTerm = dirac;
            vec3 rgb(rgbPointer[x * 3], rgbPointer[x * 3+1], rgbPointer[x * 3+2]);
            vec3 bin = rgbToBin(rgb, numberOfBins);
            int indexOfColor = bin.x * numberOfBins * numberOfBins + bin.y * numberOfBins + bin.z;
            float s3 = glfwGetTime();
            float pF = pfUMap[indexOfColor];
            float pB = pbUMap[indexOfColor];
            float heavisideTerm;
            heavisideTerm = HEAVISIDE(raw);
            float denominator = (heavisideTerm * pF + (1 - heavisideTerm) * pB) + 0.000001;
            float commonFirstTerm = -(pF - pB) / denominator * deltaTerm;
            if (pF == pB)continue;
            vec3 pixel(x, y, depthPointer[x]);

            float dfdxTerm = dfdxPointer[x];
            float dfdyTerm = -dfdyPointer[x];

            if (pixel.z == 1) {
                cv::Point c = findClosestContourPoint(cv::Point(x, y), dfdxTerm, -dfdyTerm, abs(raw));
                if (c.x == -1)continue;
                pixel = vec3(c.x, c.y, depthBuffer.at<float>(cv::Point(c.x, c.y)));
            }

            vec3 point3D = pixel;
            pixelToViewFast(point3D, cols, rows, strictFu, strictFv);


            float Xc = point3D.x; float Xc2 = Xc * Xc; float Yc = point3D.y; float Yc2 = Yc * Yc; float Zc = point3D.z; float Zc2 = Zc * Zc;
            pixelJacobi[0] = dfdyTerm * ((fv * Yc2) / Zc2 + fv) + (dfdxTerm * fu * Xc * Yc) / Zc2;
            pixelJacobi[1] = -dfdxTerm * ((fu * Xc2) / Zc2 + fu) - (dfdyTerm * fv * Xc * Yc) / Zc2;
            pixelJacobi[2] = -(dfdyTerm * fv * Xc) / Zc + (dfdxTerm * fu * Yc) / Zc;
            pixelJacobi[3] = -(dfdxTerm * fu) / Zc;
            pixelJacobi[4] = -(dfdyTerm * fv) / Zc;
            pixelJacobi[5] = (dfdyTerm * fv * Yc) / Zc2 + (dfdxTerm * fu * Xc) / Zc2;

            float weightingTerm = -1.0 / log(denominator);
            for (int i = 0; i < 6; i++) {
                pixelJacobi[i] *= commonFirstTerm;
                sJT[i] += pixelJacobi[i];
            }
            for (int i = 0; i < 6; i++) {
                for (int j = i; j < 6; j++) {
                    sJTJ[i * 6 + j] += weightingTerm * pixelJacobi[i] * pixelJacobi[j];
                }
            }

        }
    }
}

This is the part where I call each thread:

vector<std::thread> myThreads;
    float step = (roiY.y - roiY.x) / numberOfThreads;
    vector<vector<float>> tsJT(numberOfThreads, vector<float>(6, 0));
    vector<vector<float>> tsJTJ(numberOfThreads, vector<float>(36, 0));
    for (int i = 0; i < numberOfThreads; i++) {
        int start = roiY.x+i * step;
        int end = start + step;
        if (end > roiY.y)end = roiY.y;
        myThreads.push_back(std::thread(&pwp3dV2::jacobiansThread, this,start,end,std::ref(tsJT[i]), std::ref(tsJTJ[i])));
    }

    vector<float> sJT(6, 0);
    vector<float> sJTJ(36, 0);
    for (int i = 0; i < numberOfThreads; i++)myThreads[i].join();

Other Notes

To measure time I used glfwGetTime() before and right after the second code snippet. The measurements vary but the average is about 15ms as I mentioned, for both implementations.

@463035818_is_not_a_number I'm asking why using 4 threads that process 250 rows in parallel is as fast as using a single thread that processes 1000 rows. — John Katsantas, Aug 07 '21 at 17:32
@463035818_is_not_a_number Visual Studio release mode using glfwGetTime() and printing the resulting runtime. — John Katsantas, Aug 07 '21 at 17:33
Starting a thread has significant overhead, which might skew your measurements. Try keeping the threads alive between calls. — Thomas, Aug 07 '21 at 17:33
@Thomas What do you mean by keeping the threads alive? This might be irrelevant because I don't understand but after calling the threads I call join for each of them and wait until they are all done before I go on. — John Katsantas, Aug 07 '21 at 17:35
15ms is not much, and creating and joining threads adds overhead — 463035818_is_not_an_ai, Aug 07 '21 at 17:36
@Thomas And starting the threads ( meaning checking how long it took to finish the loop which starts the threads ) takes less than 1ms. — John Katsantas, Aug 07 '21 at 17:36
@463035818_is_not_a_number I'm following an implementation explained on a paper and in their implementation multi-threading improved runtime significantly. So, I assume there is something wrong with what I do. I'm trying to achieve real-time tracking and this is the last thing left to optimize. If this 15ms goes to 5ms I'm done. — John Katsantas, Aug 07 '21 at 17:36
I dont understand this line `float step = (roiY.y - roiY.x) / numberOfThreads;`, `step` should probably an integer, but what is `rioY.y` and `roiY.x` ? — 463035818_is_not_an_ai, Aug 07 '21 at 17:38
The fact that the `std::thread` constructor returns doesn't mean that the thread has begun processing data. This is up to the scheduler, among other things. The common solution is to keep threads running in the background and send them data when you need them, instead of calling the `std::thread` constructor to create a new thread every time you have some work to do. — Thomas, Aug 07 '21 at 17:39
you can try to see what happens when you increase the image size, if this does not change speedup of single-threaded vs mutliple thread then you know something is wrong, if it does, it hints at overhead. Also when you measure thread overhead you can try to measure from creating till `join` with a known workload. — 463035818_is_not_an_ai, Aug 07 '21 at 17:42
So, you are saying I should have a global `vector myThreads` and instead of creating them for every image, I should just give new data and tell them to start again? Will this improve my runtime? — John Katsantas, Aug 07 '21 at 17:42
I'll give it a go and be back then. Will also check what's happening when I increase the image size. I've read here and there that memory cache is a thing when you experience behavior like this. I have no idea what it means though. Could it be connected to the fact that I access elements from cv::Mat images? I access different rows and write to different parts of the memory so it shouldn't be a problem, right? Except for the unordered maps. All threads access the same maps. — John Katsantas, Aug 07 '21 at 17:44
FYI: my own fiddling with multi threading [Multi-threading benchmarking issues](https://stackoverflow.com/a/52835213/7478597) — Scheff's Cat, Aug 07 '21 at 18:28
@Thomas No luck when using the same threads. I reduced it less than 2ms. Should be much more, right? — John Katsantas, Aug 07 '21 at 19:32
@463035818_is_not_a_number Using higher resolution images indicated an actual 10%(at best) improvement. It should be much higher, shouldn't it? — John Katsantas, Aug 07 '21 at 19:33
@463035818_is_not_a_number Oh, I didn't see your comment about `step`. You are right, it should be an int but it doesn't affect the loops. At worst, 1 or 2 lines of pixels will be checked by more than one thread. `roiY` is a range of rows that contain important pixels. It consists of 2 integers. — John Katsantas, Aug 07 '21 at 19:38
@Thomas Turns out I made some mistakes when using global threads. Your solution reduced the time to about one third! You can add it as an answer if you want. — John Katsantas, Aug 07 '21 at 21:35
you say 15 milliseconds... that leads me to believe your time measurement function (you kept that a secret!) might have the granularity of scheduler time slices. you're getting exactly one time slice there. find a function that gives you better granularity. test that it does. — Christoph Rackwitz, Aug 07 '21 at 22:20
@ChristophRackwitz Many unknown words here for me. I added the time function in my post, sorry about that. Do you ask if my time function is essentially discrete? That , for example, it can only print 15ms, 30ms,45ms etc ? I'm sure that's not the case here. — John Katsantas, Aug 07 '21 at 22:41
In Windows, the typical "time measurenent accuracy" is 10-20 ms from my experience, I think Christoph Rackwitz knows the terms and details and correct value for that (tine slice?). Can you try much bigger images that for example need 1 second to compute in single threaded scenario? — Micka, Aug 08 '21 at 06:54
"essentially discrete" yes. it might have to do with scheduler time slices, or some other reason. in any case, it's the interval of some timer interrupt. 15.625 ms is 1/64 second. some Windowses run with different timer interrupt intervals (nice round 10 or 1 ms). if you *aren't* running on windows, but get a visible granularity of ~15 ms (31, 46, 62...), that would surprise me. test this with a dumb for-loop that does nothing a lot, and see what time differences you can reach. — Christoph Rackwitz, Aug 08 '21 at 10:29

Thomas · Accepted Answer · 2021-08-08T09:33:57.723

4

Starting a thread has significant overhead, which might not be worth the time if you have only 15 milliseconds worth of work.

The common solution is to keep threads running in the background and send them data when you need them, instead of calling the std::thread constructor to create a new thread every time you have some work to do.

edited Aug 08 '21 at 09:33

answered Aug 08 '21 at 08:47

Thomas

174,939
50
355
478

15 ms is forever. Creating a thread takes much less than 1ms and, if signaling threads that are waiting, (as suggested), the latency between source thread signal and target thread run is typically in the order of 10us if there is a core available. – Martin James Aug 08 '21 at 09:32
1

I was surprised that it seemed to take so long too, but OP said it solved their problem and asked me to post this as an answer. If spawning 4 threads on a 4-core system, there won't be a core available for at least one of those threads. Maybe that explains it. – Thomas Aug 08 '21 at 09:35

doron · Answer 2 · 2021-08-07T17:46:21.583

1

Pure spectaculation but two things might be preventing the full power of parallelization.

Processing speed is limited by the memory bus. Cores will wait until data is loaded before continuing.
Data sharing between cores. Some caches are core specific. If memory is shared between cores, data must traverse down to shared cache before loading.

On Linux you can use Perf to check for cache misses.

edited Aug 07 '21 at 17:46

answered Aug 07 '21 at 17:44

doron

27,972
12
65
103

I have no idea what these all mean. But all similar questions seem to revolve around what you mentioned. Could you refer me to some written examples so I can study these things? Is there a way to tell if this is the case just by looking at my code? – John Katsantas Aug 07 '21 at 17:46
The most typical problem regarding memory is when you trash your cache, e.g. in each loop iteration you jump over a portion of memory. For example if your image is stored row-by-row but you iterate over columns first in your innermost loop. However, as you are using `cv::Mat` and access with `at(cv::Point(x,y))` where `x` is the inner-most loop variable this should be fine. Nevertheless, if you want to explore this avenue, read the description of `cv::Mat` class in the doc, where they discuss the memory layout, and then check how you access the object from your code. – CygnusX1 Aug 07 '21 at 20:43

Алексей Неудачин · Answer 3 · 2021-08-07T23:17:22.513

if you wanna better time you need to split a cycle runs from a counter, for this you need to do some preprocessing. some fast stuff like make an array of structures with headers for each segment or so. if say you can't mind anything better you can just do vector<int> with values of a counter. Then do for_each(std::execution::par,...) on that. way much faster.
for timings there's

auto t2 = std::chrono::system_clock::now();
std::chrono::milliseconds f = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1);

Why isn't my 4 thread implementation faster than the single thread one?

3 Answers3