-3

I am parallelizing an application in C# and am testing the performance difference between using implicit threading versus explicit threading. Both techniques utilize the System.Threading library, and the implicit threading is characterized by using a Parallel.For loop while the explicit threading involves creating, starting, and joining threads while also calculating chunk sizes, calling the worker function, etc.

I have found that I achieve better speed up over the original sequential version of the program by utilizing explicit threading (about 1.2x faster after 50 trials) on eight cores. I understand the underlying differences between these two techniques, however, I am not sure why the explicit version seems to be faster. I thought that perhaps the implicit version would be faster as tasks would be scheduled automatically, as opposed to manual task and thread creation. Would there be a reason (apart from perhaps an error in my results) that the explicit version would be faster?

For reference, a summarized version of the relevant code can be seen below.

float[][] stft_implicit(Complex[] x, int wSamp)
{
    //...
    Parallel.For(0, size, new ParallelOptions { MaxDegreeOfParallelism = MainWindow.NUM_THREADS }, ii =>
    {
        Complex[] tempFFT = IterativeFFT.FFT(all_temps[ii], twiddles, wSamp);
        fft_results[ii] = tempFFT;
    });
    //...
}

float[][] stft_explicit(Complex[] x, int wSamp)
{
    //...
    length = (int)(2 * Math.Floor((double)N / (double)wSamp) - 1);
    chunk_size = (length + MainWindow.NUM_THREADS - 1) / MainWindow.NUM_THREADS;

    Thread[] threads = new Thread[MainWindow.NUM_THREADS];

    for (int i = 0; i < MainWindow.NUM_THREADS; i++)
    {
        threads[i] = new Thread(fft_worker);
        threads[i].Start(i);
    }

    for (int i = 0; i < MainWindow.NUM_THREADS; i++)
    {
        threads[i].Join();
    }
    //...
}

public void fft_worker(object thread_id)
{
    int ID = (int)thread_id;
    Complex[] temp = new Complex[wSamp];
    Complex[] tempFFT = new Complex[wSamp];
    int start = ID * chunk_size;
    int end = Math.Min(start + chunk_size, length);

    for (int ii = start; ii < end; ii++)
    {
        //...
        tempFFT = IterativeFFT.FFT(temp, twiddles, wSamp);
        //...
    }
}
Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
John4
  • 15
  • 4
  • 1
    Since you know the difference between your approaches I doubt you are benchmarking wrong, but there is still tiny chance of it - showing benchmarking code (in particular part where you warm up the thread pool) would remove that stupid and offensive concern. – Alexei Levenkov Oct 09 '20 at 00:50
  • Ideally you would run this in BenchmarkDotNet then you would show your benchmarking class, and the results. At the moment we can only take your word for it, and cant reliably test the results or prove that these are apples to apples comparison. On saying that, all things being equal, an explanation of why you are getting a certain results is likely to be unsatisfying. – TheGeneral Oct 09 '20 at 00:54
  • 1
    `however, I am not sure why the explicit version seems to be faster.` _Most likely_ because you forgot to call https://learn.microsoft.com/en-us/dotnet/api/system.threading.threadpool.setminthreads?view=netcore-3.1 . If that number is < `NUM_THREADS` then threads from the thread pool will be available "more slowly" than if you create them directly (since the thread pool is conservative in terms of spinning up new threads). – mjwills Oct 09 '20 at 00:56
  • 1
    `I thought that perhaps the implicit version would be faster as tasks would be scheduled automatically, as opposed to manual task and thread creation.` Automatic does not necessarily mean _faster_. It often means "better overall resource allocation" - which might mean slower, but better reuse (for example). – mjwills Oct 09 '20 at 00:58
  • @mjwills I have voted to reopen the question because the [proposed as duplicate](https://stackoverflow.com/questions/6637932/why-sometimes-task-is-significantly-slower-than-thread) is about tasks vs threads, while this question is about `Parallel.For` vs manual partitioning and thread management. The answer to the other question is not a valid answer for this question, and vice versa. Duplicates may exist, but the proposed does not seem to be one of them. – Theodor Zoulias Oct 09 '20 at 01:14
  • 2
    When you provide "a summarized version of the relevant code" you don't allow us the option of running your code and testing for ourselves. You should provide a [mcve], including source data, and a summary of your benchmarking results. We can then verify that we are getting the same results - there might be an environmental issue affecting your results, but not ours - and then we can refactor the code to either ensure you're actually measuring what you think you are measuring or so that we can show alternatives that do perform well. – Enigmativity Oct 09 '20 at 01:38
  • @mjwills from a quick search I can't find any suitable duplicate. [This](https://stackoverflow.com/questions/13070339/parallel-for-vs-regular-threads) and [this](https://stackoverflow.com/questions/3572554/c-sharp-parallel-vs-threaded-code-performance) questions are looking superficially similar, but they have specific context and nuances that distinguishes them from this question. Btw yes, the `Parallel.For` is based on the TPL, but AFAIK the performance difference observed by the OP is related to specifics of the `Parallel` API, and not to its underlying implementation. – Theodor Zoulias Oct 09 '20 at 01:40
  • @John4 Can you confirm whether `SetMinThreads` helped (or not) (as well as provide a [mcve])? – mjwills Oct 09 '20 at 01:48
  • 1
    Thank you all for your responses. A minimal reproducible answer would is not feasible due to the large scope of the overarching project, and I was just after an answer from a theoretical and educational point of view on the behaviour of implicit vs explicit threads generally, not specific to technical aspects of the example reference code. I will accept the answer by @Theodor Zoulias and am happy for the question to be closed due to lack of clarity. – John4 Oct 09 '20 at 02:54
  • Did `SetMinThreads` help? – mjwills Oct 09 '20 at 03:29
  • @mjwills I wouldn't expect configuring the `ThreadPool` to have much effect, because the OP's workload seems purely CPU-bound. CPU-bound workloads generally do not benefit by throwing more threads to the pool, because the default size of the pool is enough for 100% CPU utilization. Increasing the threads makes sense when your workload is I/O-bound, and you are (unwillingly) blocking threads because the API you have to use is not asynchronous. – Theodor Zoulias Oct 09 '20 at 18:26
  • 1
    You may be right @TheodorZoulias. Disappointing that we lack a [mcve] to validate that. – mjwills Oct 09 '20 at 21:59

1 Answers1

0

I think that the comparison is not fair for the Parallel.For, because it has to invoke an anonymous lambda for each element of the processed array, while the explicit threading implementation involves a single method invocation per thread (the fft_worker method). What makes this even more important is that anonymous lambdas cannot be inlined by the C# compiler.

To restore the fairness of the comparison you can either:

  1. Include the overhead of an anonymous lambda invocation in the explicit threading implementation too:
for (int ii = start; ii < end; ii++)
{
    ((Action)(() =>
    {
        //...
        tempFFT = IterativeFFT.FFT(temp, twiddles, wSamp);
        //...
    }))();
}
  1. Reduce the granularity (or in other words increase the chunkiness) of the implicit threading implementation, by replacing the Parallel.For with the Parallel.ForEach+Partitioner combo:
Parallel.ForEach(Partitioner.Create(0, size), range =>
{
    for (int ii = range.Item1; ii < range.Item2; ii++)
    {
        Complex[] tempFFT = IterativeFFT.FFT(all_temps[ii], twiddles, wSamp);
        fft_results[ii] = tempFFT;
    }
});

I haven't tested it, but both of these suggestions should close or eliminate the gap in the performance of the two parallelization techniques.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104