How to properly parallelize worker tasks?

Question

Consider the following code snippet and notice the difference in total runtime between setting numberTasksToSpinOff equal to 1 and then 3,4, or more (depending on thread resources on your machine). I notice much longer run times when spinning off more tasks.

I passed on purpose a data collection into each worker instance that each worker tasks reads from at the same time. I thought that tasks can access a shared data structure without blocking as long as those operations are only reads or enumerations.

My goal is to spin off multiple tasks that iterate over the same shared data structure via read operations and complete altogether at around the same time regardless of number tasks spun off.

Edit: Please see second code snippet where I implement Parallel.Foreach() and create each worker's own dataset, hence no accessing identical data structures by different tasks/threads. Yet I still see an unacceptable amount of overhead.

class Program
{
    static void Main(string[] args)
    {
        Console.WriteLine($"Entry Main Function Thread Id: {Thread.CurrentThread.ManagedThreadId}");

        //run
        var task = Task.Run(async () =>
        {
            Console.WriteLine($"Entry RunMe Task Thread Id: {Thread.CurrentThread.ManagedThreadId}");
            await RunMe();

            Console.WriteLine($"Exit RunMe Task Thread Id: {Thread.CurrentThread.ManagedThreadId}");
        });

        task.Wait();

        Console.WriteLine($"Exit Main Function Thread Id: {Thread.CurrentThread.ManagedThreadId}");

        Console.WriteLine("Press key to quit");
        Console.ReadLine();
    }

    private static async Task RunMe()
    {
        var watch = new Stopwatch();
        var numberTasksToSpinOff = 6;
        var numberItems = 20000;
        var random = new Random((int)DateTime.Now.Ticks);
        var dataPoints = Enumerable.Range(1, numberItems).Select(x => random.NextDouble()).ToList();
        var tasks = new List<Task>();
        var workers = new List<Worker>();

        //structure workers
        for (int i = 1; i <= numberTasksToSpinOff; i++)
        {
            workers.Add(new Worker(i, dataPoints));
        }

        //start timer
        watch.Restart();

        //spin off tasks
        foreach (var worker in workers)
        {
            tasks.Add(Task.Run(() =>
            {
                Console.WriteLine($"Entry WorkerId: {worker.WorkerId} -> New Tasks spun off with in Thread Id: {Thread.CurrentThread.ManagedThreadId}");
                worker.DoSomeWork();
                Console.WriteLine($"Exit WorkerId: {worker.WorkerId} -> New Tasks spun off with in Thread Id: {Thread.CurrentThread.ManagedThreadId}");
            }));

        }

        //completion tasks
        await Task.WhenAll(tasks);

        //stop timer
        watch.Stop();

        Console.WriteLine($"Time it took to complete in Milliseconds: {watch.ElapsedMilliseconds}");
    }
}

public class Worker
{
    public int WorkerId { get; set; }
    private List<double> _data;

    public Worker(int workerId, List<double> data)
    {
        WorkerId = workerId;
        _data = data;
    }

    public void DoSomeWork()
    {
        var indexPos = 0;

        foreach (var dp in _data)
        {
            var subSet = _data.Skip(indexPos).Take(_data.Count - indexPos).ToList();
            indexPos++;
        }
    }
}

Second Code Snippet:

class Program
{
    static void Main(string[] args)
    {
        var watch = new Stopwatch();
        var numberTasksToSpinOff = 1;
        var numberItems = 20000;
        //var random = new Random((int)DateTime.Now.Ticks);
        //var dataPoints = Enumerable.Range(1, numberItems).Select(x => random.NextDouble()).ToList();
        var workers = new List<Worker>();

        //structure workers
        for (int i = 1; i <= numberTasksToSpinOff; i++)
        {
            workers.Add(new Worker(i));
        }

        //start timer
        watch.Restart();

        //parellel work

        if (workers.Any())
        {
            var processorCount = Environment.ProcessorCount;
            var parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = processorCount };
            Parallel.ForEach(workers, parallelOptions, DoSomeWork);
        }

        //stop timer
        watch.Stop();
        Console.WriteLine($"Time it took to complete in Milliseconds: {watch.ElapsedMilliseconds}");

        Console.WriteLine("Press key to quit");
        Console.ReadLine();
    }

    private static void DoSomeWork(Worker worker)
    {
        Console.WriteLine($"WorkerId: {worker.WorkerId} -> New Tasks spun off with in Thread Id: {Thread.CurrentThread.ManagedThreadId}");

        var indexPos = 0;

        foreach (var dp in worker.Data)
        {
            var subSet = worker.Data.Skip(indexPos).Take(worker.Data.Count - indexPos).ToList();
            indexPos++;
        }
    }
}

public class Worker
{
    public int WorkerId { get; set; }
    public List<double> Data { get; set; }

    public Worker(int workerId)
    {
        WorkerId = workerId;

        var numberItems = 20000;
        var random = new Random((int)DateTime.Now.Ticks);
        Data = Enumerable.Range(1, numberItems).Select(x => random.NextDouble()).ToList();

    }
}

Did you do _anything_ at all to debug this yet? There's no evidence in your question that you've made any attempt at all to explain your observations. Use a profiler, figure out where all the time is being spent. I'll bet you find it's the garbage collector, because the main thing your tasks do is create a _lot_ of garbage. — Peter Duniho, Feb 06 '18 at 07:38
@PeterDuniho, yes I have debugged the code, I even printed out thread IDs in-code. And you are incorrect, even when I just iterate the data structure without creating any additional data I see the same explosion in total execution time. I find your "close vote" going overboard in this case without first clarifying. If I had figured it all out I would not have posted a question. I thought this is exactly what this site is for. Not everyone may possess the same knowledge at this point in time as you do, hence the asking for advice. — Matt, Feb 06 '18 at 07:44
This seems like a reasonable question with some reasonable debate to me..not sure why its being closed, especially if a solution might be found? — Mark Redman, Feb 06 '18 at 07:47

score 3 · Accepted Answer · answered Feb 06 '18 at 07:26

3

NOTE: The following answer is based on testing and observation and not definitiv knowledge.

The more task you spin off the more overhead you generate and thus the total execution time also rises. BUT if you think of it from another viewpoint you will see that the actually processed "data-points" will increase the more tasks you spin up (up until you reach the limit of available hardware-threads):

The following values are generated on my machine (4C/8T) with 10000 points per list:

1 worker -> 1891 ms -> 5288 p/s
2 worker -> 1921 ms -> 10411 p/s
4 worker -> 2670 ms -> 14981 p/s
8 worker -> 4871 ms -> 16423 p/s
12 worker -> 7449 ms -> 16109 p/s

There you see until I reach my "core-limit" the processed data increases significantly, then until I reach my "thread-limit" it increases still noticeable, but after that it decreases again, because of the risen overhead and no more available hardware-resources.

answered Feb 06 '18 at 07:26

Christoph Fink

22,727
9
68
113

I agree with all your observations above, but as long as the number hardware threads do not constrain the number workers that are run in parallel your above reasoning does not hold. I run on a dual Xeon machine with 24 cores and 48 hyper threads. Hardware specs in this case definitely do not constrain a job with 10 workers, yet total execution time still ends up being a multiple of a job with 1 or 2 workers. Spinning off more tasks/threads costs a little but not that much. Something else seems to be going on. – Matt Feb 06 '18 at 07:35
@MattWolf I did run the code through my profile and all the increased time does come from the `ToList` call. It gets much slower with more threads in parallel... – Christoph Fink Feb 06 '18 at 08:15
...because it allocates space that is then garbage collected as @Peter Duniho hinted at? I am profiling right now, using `GC.TryStartNoGCRegion ` and `GC.EndNoGCRegion` – Matt Feb 06 '18 at 08:22
I am still stuck on this, preventing the garbage collector to work in the critical region did not yield any benefits at all. But obviously some other resource management by CLR is still interfering, possibly memory allocation for the created objects? – Matt Feb 06 '18 at 08:32
@MattWolf Does your production-code also use `ToList`? Maybe you can avoid using that to speed it up? – Christoph Fink Feb 06 '18 at 08:36
you were right, GC was the main issue here. Just wanted to provide feedback on your answer and mark it as correct (house cleaning ;-) – Matt Aug 23 '18 at 05:03

score 2 · Answer 2 · answered Feb 06 '18 at 07:04

2

Have you had a look at Parallel Tasks? You could then do something like this.

eg:

if (workers.Any())
{
    var parallelOptions = new ParallelOptions {MaxDegreeOfParallelism = Environment.ProcessorCount};
    Parallel.ForEach(workers, parallelOptions, DoSomeWork);
}

private static void DoSomeWork(Worker worker)
{
}

answered Feb 06 '18 at 07:04

Mark Redman

24,079
20
92
147

Caveat.. Ideally if the workloads aren't IO bound operations, I.e things that use IO Completion ports. – TheGeneral Feb 06 '18 at 07:06
As far as I know, this becomes useful when there is a big work load for each operation, right? – Ferus7 Feb 06 '18 at 07:06
I have used this specifically for long running file IO tasks and works well. – Mark Redman Feb 06 '18 at 07:07
2

@MarkRedman, I will check it out but what causes the additional workload in my original code? Am I using async/await or tasks incorrectly? I am asking because later on I want to be able to add new workers/tasks dynamically during runtime without knowing beforehand how many workers/tasks I have at hand at any given point in time. – Matt Feb 06 '18 at 07:08
The way I understand it, using async/away as you are with WhenAll, is making use of the resources on same thread, whereas Parallel will do the work in individual threads. – Mark Redman Feb 06 '18 at 07:11
@MarkRedman No, Matt's implementation ends up being about the same as when using `Parallel.ForEach`. Also the timings are exactly the same (I did just test both variants and they have exactly the same behaviour)... – Christoph Fink Feb 06 '18 at 07:13
@MarkRedman, Ok but as you can see when you run my code, each task runs on its own thread already. – Matt Feb 06 '18 at 07:13
@ChrFin, that is what I am getting just now as well. Could it be that multiple enumerators over the shared data collection somehow block? – Matt Feb 06 '18 at 07:14
I see, yes looks like it will do the same job. – Mark Redman Feb 06 '18 at 07:15
@MattWolf This was also my first guess, but no: I did also run a test with a new collection for each worker and that gave me the same result... – Christoph Fink Feb 06 '18 at 07:16
re running additional tasks at runtime, I have have this code in a windows service that will run periodically, getting the list (ie list of workers) then running the parallel tasks. adjusting the timing and degrees of parallelism you can tune it nicely... in the windows service, I usually run a timer, stop the timer run the code, then start the timer again so nothing overruns. – Mark Redman Feb 06 '18 at 07:22
@ChrFin, thanks, I was about to try that as well, just did not think about it from the outset as my requirement is a shared collection due to memory limitations. – Matt Feb 06 '18 at 07:22
I am not sure if you can share the data without some kind of synchronisation, or queuing the work (no parallel tasks) – Mark Redman Feb 06 '18 at 07:24
@MarkRedman, thanks for trying that, though that does unfortunately not answer the question or issue at hand. Also, I need to run this code in a class library, not as a Windows service. Also, please see ChrFin's comment, it does not make a difference whether I operate over a shared data collection or have each worker create its own resource. – Matt Feb 06 '18 at 07:24
Sure, the service is just used to repeat the process ongoing, could be anything else triggering it. – Mark Redman Feb 06 '18 at 07:24
can you not just pass in the data each time? worth a try? – Mark Redman Feb 06 '18 at 07:26
@MarkRedman, yes I have done that, just as ChrFin suggested, no luck. Each worker generates its own dataset. Runtime still more than doubles between 1 worker and 5 workers. – Matt Feb 06 '18 at 07:29
Depending on the data and required processing time you will probably need to find the right balance/compromise, its seems the code is simple enough to not have major issues. So in terms of your question, I think you are doing it properly, but could simplify it using Parallel. – Mark Redman Feb 06 '18 at 07:30
@MarkRedman, I like your approach with `Parallel.ForEach()` as it simplifies code yet the issue here is not one of "balance/compromise" as I have 24 cores available and running 10 tasks/threads in parallel is definitely not constrained by hardware/system resources. – Matt Feb 06 '18 at 07:38
You know what, I am still convinced that you are waiting for tasks on the same thread and that I was correct before. – Mark Redman Feb 06 '18 at 07:40
see https://stackoverflow.com/questions/32726294/is-async-await-using-task-run-starting-a-new-thread-asynchronously – Mark Redman Feb 06 '18 at 07:42
@MarkRedman, as I pointed out I tested with your implementation as well. I can clearly see if work is queued up due to limitations in available threads. This is not the case here. Please see my edit and second code snippet. – Matt Feb 06 '18 at 07:47
@MarkRedman Yes, he is *waiting* on the same thread (which is *not* blocked, because of async/await), but the *processing* happens in an own thread each. And as already said an tested it does not make any difference in terms of execution time if changed to `Parallel.ForEach`... – Christoph Fink Feb 06 '18 at 07:50
ok, then I am not sure what is going on, sorry. I assume the the task you are running, which looks like some test code, is successful and not overflowing or erroring etc is that code representative of what the final task is? – Mark Redman Feb 06 '18 at 07:52

How to properly parallelize worker tasks?

2 Answers2

Linked