21

I've split a complex array processing task into a number of threads to take advantage of multi-core processing and am seeing great benefits. Currently, at the start of the task I create the threads, and then wait for them to terminate as they complete their work. I'm typically creating about four times the number of threads as there are cores, as each thread is liable to take a different amount of time, and having extra threads ensures all cores are kept occupied most of the time. I was wondering would there be much of a performance advantage to creating the threads as the program fires up, keeping them idle until required, and using them as I start processing. Put more simply, how long does it take to start and end a new thread above and beyond the processing within the thread? I'm current starting the threads using

CWinThread *pMyThread = AfxBeginThread(CMyThreadFunc,&MyData,THREAD_PRIORITY_NORMAL);

Typically I will be using 32 threads across 8 cores on a 64 bit architecture. The process in question currently takes < 1 second, and is fired up each time the display is refreshed. If starting and ending a thread is < 1ms, the return doesn't justify the effort. I'm having some difficulty profiling this.

A related question here helps but is a bit vague for what I'm after. Any feedback appreciated.

Community
  • 1
  • 1
SmacL
  • 22,555
  • 12
  • 95
  • 149
  • 3
    Display refresh sounds like a frequent event, there would definitely be a benefit of pooling threads and keeping them idle, for further reuse. Thread creation overhead is perhaps not something too heavy but still it's synchronization expense, it's virtual memory footprint etc. – Roman R. Aug 16 '13 at 13:20
  • 2
    Not an answer, but running 32 threads across 8 cores won't be a good solution if your task is CPU bound. You might be better off by using a number that is closer to the number of actual hardware threads you can use. I would suggest creating a thread pool upfront, no matter how much it takes to create them, you might want to reuse them :) – David Rodríguez - dribeas Aug 16 '13 at 13:21
  • 1
    @DavidRodríguez-dribeas, I've experimented with differing number of threads, and found more threads than the maximum concurrency works best in this context. The reason being that the total elapsed time is based on when the last thread finishes, and each thread has a different workload which is difficult to estimate in advance. If I have one thread with significantly more work than the others, this slows down the whole process with one only thread per core, as the cores that have finished their work are lying idle. – SmacL Aug 16 '13 at 13:28
  • @ShaneMacLaughlin: +1 on *measuring* and understanding the behavior. – David Rodríguez - dribeas Aug 16 '13 at 13:30

3 Answers3

19

I wrote this quite a while ago when I had the same basic question (along with another that will be obvious). I've updated it to show a little more about not only how long it takes to create threads, but how long it takes for the threads to start executing:

#include <windows.h>
#include <iostream>
#include <time.h>
#include <vector>

const int num_threads = 32;

const int switches_per_thread = 100000;

DWORD __stdcall ThreadProc(void *start) {
    QueryPerformanceCounter((LARGE_INTEGER *) start);
    for (int i=0;i<switches_per_thread; i++)
        Sleep(0);
    return 0;
}

int main(void) {
    HANDLE threads[num_threads];
    DWORD junk;

    std::vector<LARGE_INTEGER> start_times(num_threads);

    LARGE_INTEGER l;
    QueryPerformanceCounter(&l);

    clock_t create_start = clock();
    for (int i=0;i<num_threads; i++)
        threads[i] = CreateThread(NULL, 
                            0, 
                            ThreadProc, 
                            (void *)&start_times[i], 
                            0, 
                            &junk);
    clock_t create_end = clock();

    clock_t wait_start = clock();
    WaitForMultipleObjects(num_threads, threads, TRUE, INFINITE);
    clock_t wait_end = clock();

    double create_millis = 1000.0 * (create_end - create_start) / CLOCKS_PER_SEC / num_threads;
    std::cout << "Milliseconds to create thread: " << create_millis << "\n";
    double wait_clocks = (wait_end - wait_start);
    double switches = switches_per_thread*num_threads;
    double us_per_switch = wait_clocks/CLOCKS_PER_SEC*1000000/switches;
    std::cout << "Microseconds per thread switch: " << us_per_switch;

    LARGE_INTEGER f;
    QueryPerformanceFrequency(&f);

    for (auto s : start_times) 
        std::cout << 1000.0 * (s.QuadPart - l.QuadPart) / f.QuadPart <<" ms\n";

    return 0;
}

Sample results:

Milliseconds to create thread: 0.015625
Microseconds per thread switch: 0.0479687

The first few thread start times look like this:

0.0632517 ms
0.117348 ms
0.143703 ms
0.18282 ms
0.209174 ms
0.232478 ms
0.263826 ms
0.315149 ms
0.324026 ms
0.331516 ms
0.3956 ms
0.408639 ms
0.4214 ms

Note that although these happen to be monotonically increasing, that's not guaranteed (though there is definitely a trend in that general direction).

When I first wrote this, the units I used made more sense -- on a 33 MHz 486, those results weren't tiny fractions like this. :-) I suppose someday when I'm feeling ambitious, I should rewrite this to use std::async to create the threads and std::chrono to do the timing, but...

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • Thanks Jeffry, exactly the info I was after! – SmacL Aug 16 '13 at 14:45
  • Hmm.. interesting. I already knew that making a blocked thread ready via. kernel signaling took ~ 50ns, but I have never timed thread creation, ('cos I hardly ever do it except at app startup). – Martin James Aug 16 '13 at 15:56
  • Thank you for the code, though I'm shocked at the results of the test. I was creating 256 threads for a project, though the time it took to create them caused minor issues. So i ran your program here to test the speed of the thread creation..... i got 1.0625 for thread creation and .997344 for thread switching. I would have imagined the speed of the switching and the speed of creation would have been much better than that. – Bored915 Oct 12 '13 at 06:59
  • 2
    Jerry Coffin, your code is awfully wrong. You don't measure the time correctly. The creation of threads is async action. When you measure 'create_end' it measures the amount of time taken to request for 64 threads not the actual creation of the threads. Same goes for switch time measurement. Your code is totally wrong and it only confuses others. Please fix it or remove it. – DanielHsH Feb 25 '15 at 08:08
  • 2
    Jerry. When you call 'CreateThread' - this is an async method. Your loop that creates 64 threads can terminate (64 calls to Create thread) even before the first thread was created. I will give you an example. How much time it takes to get a grin card to US? probably few month or years. But it takes only few minutes to send a letter to US immigration requesting a grin-card. After you send a letter, the legal debate upon giving you a grin-card is async. Your code measures the amount of time it takes to send a letter. – DanielHsH Feb 25 '15 at 09:45
  • 1
    Correct solution: First line in ThreadProc() should be clock(). Each thread writes the time stamp when it actually started to an array of 64 cells. Than the main threads reads this array subtracts its 'create_start' prior to calling the thread. You see, a valid measurement is an interval between 2 time stamps: Prior to call to CreateThread() and first line in ThreadProc(). Thus you know exactly how much it took to thread to start – DanielHsH Feb 25 '15 at 09:48
  • 1
    P.s. - you argument of paused threads is irrelevant. You code never measures what it intends to do. Also the measurement of switches is incorrect because you actually take into the calculation the time it takes to kill all 64 threads. – DanielHsH Feb 25 '15 at 09:54
  • 1
    Jeffry, please stop laughing at me. I am very serious. I worked a lot with threads. The fact that your windows CreateThread() or linux equivalent pthread_create() function terminated does not mean that the thread was created. All it means that you requested the operating system to create a new thread. OS will decide when in future to actually start your thread and schedule its first run. When you measure the time of CreateThread() you actually measure the time of your request, not the time it took to create the tread – DanielHsH Feb 25 '15 at 14:38
  • @DanielHsH: I'm not convinced it's really a huge improvement, but it's quite a bit shorter than this comment thread, so I've added code that adds measurement of the time between starting to create threads, and the time that each thread actually begins running. – Jerry Coffin Feb 27 '15 at 21:46
  • Jerry, Thanks a lot. I upvoted your answer :-) Those numbers seems legitimate. Please note however that your code will not measure the switches correctly in 'release' mode on some compilers due to optimizations. Some compilers are smart enough to remove the Sleep(0) and the loop because it does not achieve anything. In my tests I used a busy loop (~10 million xor calculations of 64 bits) for that reason. – DanielHsH Feb 28 '15 at 11:37
  • 3
    @DanielHsH, do you have any reference to a compiler that removes Sleep(0), as it would seem like a unusual optimization to me. Given that Sleep(0) relinquishes the remaining time slice for a thread, removing it clearly affects performance of a multi-threaded program. – SmacL Mar 07 '16 at 08:33
4

Some advices:

  1. If you have lots of work items to process (or there aren't too many, but you have to repeat the whole process time to time), make sure you use some kind of thread pooling. This way you won't have to recreate the threads all the time, and your original question won't matter any more: the threads will be created only one time. I use the QueueUserWorkItem API directly (since my application doesn't use MFC), even that one is not too painful. But in MFC you may have higher level facilities to take advantage of the thread pooling. (http://support.microsoft.com/kb/197728)
  2. Try to select the optimal amount of work for one work item. Of course this depends on the feature of your software: is it supposed to be real time, or it's a number crunching in the background? If it's not real-time, then too small amount of work per work item can hurt performance: by increasing the proportion of overhead of the work distribution across threads.
  3. Since hardware configurations can be very different, if your end-users can have various machines you can include some calibration routines asynchronously during the start of the software, so you can estimate how much time certain operation takes. The result of the calibration can be an input for a better work size setting later for the real calculations.
Csaba Toth
  • 10,021
  • 5
  • 75
  • 121
2

I was curious about the modern Windows scheduler, so I wrote another test app. I made my best attempt at measuring thread stop time by optionally spinning up a watching thread.

// Tested on Windows 10 v1903 with E5-1660 v3 @ 3.00GHz, 8 Core(s), 16 Logical Processor(s)
// Times are (min, average, max) in milliseconds.

threads: 100, iterations: 1, testStop: true
Start(0.1083, 5.3665, 13.7103) - Stop(0.0341, 1.5122, 11.0660)

threads: 32, iterations: 3, testStop: true
Start(0.1349, 1.6423, 3.5561) - Stop(0.0396, 0.2877, 3.5195)
Start(0.1093, 1.4992, 3.3982) - Stop(0.0351, 0.2734, 2.0384)
Start(0.1159, 1.5345, 3.5754) - Stop(0.0378, 0.4938, 3.2216)

threads: 4, iterations: 3, testStop: true
Start(0.2066, 0.3553, 0.4598) - Stop(0.0410, 0.1534, 0.4630)
Start(0.2769, 0.3740, 0.4994) - Stop(0.0414, 0.1028, 0.2581)
Start(0.2342, 0.3602, 0.5650) - Stop(0.0497, 0.2199, 0.3620)

threads: 4, iterations: 3, testStop: false
Start(0.1698, 0.2492, 0.3713)
Start(0.1473, 0.2477, 0.4103)
Start(0.1756, 0.2909, 0.4295)

threads: 1, iterations: 10, testStop: false
Start(0.1910, 0.1910, 0.1910)
Start(0.1685, 0.1685, 0.1685)
Start(0.1564, 0.1564, 0.1564)
Start(0.1504, 0.1504, 0.1504)
Start(0.1389, 0.1389, 0.1389)
Start(0.1234, 0.1234, 0.1234)
Start(0.1550, 0.1550, 0.1550)
Start(0.2800, 0.2800, 0.2800)
Start(0.1587, 0.1587, 0.1587)
Start(0.1877, 0.1877, 0.1877)

Source:

#include <windows.h>
#include <iostream>
#include <vector>
#include <chrono>
#include <iomanip>

using namespace std::chrono;

struct Test
{
    HANDLE Thread = { 0 };
    time_point<steady_clock> Creation;
    time_point<steady_clock> Started;
    time_point<steady_clock> Stopped;
};

DWORD __stdcall ThreadProc(void* lpParamater) {
    auto test = (Test*)lpParamater;
    test->Started = steady_clock::now();
    return 0;
}

DWORD __stdcall TestThreadsEnded(void* lpParamater) {
    auto& tests = *(std::vector<Test>*)lpParamater;

    std::size_t finished = 0;
    while (finished < tests.size())
    {
        for (auto& test : tests)
        {
            if (test.Thread != NULL && WaitForSingleObject(test.Thread, 0) == WAIT_OBJECT_0)
            {
                test.Stopped = steady_clock::now();
                test.Thread = NULL;
                finished++;
            }
        }
    }

    return 0;
}

duration<double, std::milli> diff(time_point<steady_clock> start, time_point<steady_clock> stop)
{
    return stop - start;
}

struct Stats
{
    double min;
    double average;
    double max;
};

Stats stats(const std::vector<double>& durations)
{
    Stats stats = { 1000, 0, 0 };

    for (auto& duration : durations)
    {
        stats.min = duration < stats.min ? duration : stats.min;
        stats.max = duration > stats.max ? duration : stats.max;
        stats.average += duration;
    }

    stats.average /= durations.size();

    return stats;
}

void TestScheduler(const int threadCount, const int iterations, const bool testStop)
{
    std::cout << "\nthreads: " << threadCount << ", iterations: " << iterations << ", testStop: " << (testStop ? "true" : "false") << "\n";

    for (auto i = 0; i < iterations; i++)
    {
        std::vector<Test> tests(threadCount);
        HANDLE testThreadsEnded = NULL;

        if (testStop)
        {
            testThreadsEnded = CreateThread(NULL, 0, TestThreadsEnded, (void*)& tests, 0, NULL);
        }

        for (auto& test : tests)
        {
            test.Creation = steady_clock::now();
            test.Thread = CreateThread(NULL, 0, ThreadProc, (void*)& test, 0, NULL);
        }

        if (testStop)
        {
            WaitForSingleObject(testThreadsEnded, INFINITE);
        }
        else
        {
            std::vector<HANDLE> threads;
            for (auto& test : tests) threads.push_back(test.Thread);
            WaitForMultipleObjects((DWORD)threads.size(), threads.data(), TRUE, INFINITE);
        }

        std::vector<double> startDurations;
        std::vector<double> stopDurations;
        for (auto& test : tests)
        {
            startDurations.push_back(diff(test.Creation, test.Started).count());
            stopDurations.push_back(diff(test.Started, test.Stopped).count());
        }

        auto startStats = stats(startDurations);
        auto stopStats = stats(stopDurations);

        std::cout << std::fixed << std::setprecision(4);
        std::cout << "Start(" << startStats.min << ", " << startStats.average << ", " << startStats.max << ")";
        if (testStop)
        {
            std::cout << " - ";
            std::cout << "Stop(" << stopStats.min << ", " << stopStats.average << ", " << stopStats.max << ")";
        }
        std::cout << "\n";
    }
}

int main(void)
{
    TestScheduler(100, 1, true);
    TestScheduler(32, 3, true);
    TestScheduler(4, 3, true);
    TestScheduler(4, 3, false);
    TestScheduler(1, 10, false);
    return 0;
}
gradbot
  • 13,732
  • 5
  • 36
  • 69