0

I work as the sole application developer within a database-focussed team. Recently, I've been trying to improve the efficiency of a process which my predecessor had prototyped. The best way to do this was to thread it. So this was my approach:

public void DoSomething()
{
    Parallel.ForEach(rowCollection), (fr) =>
    {
        fr.Result = MyCleaningOperation();
    });
}

Which functions fine, but causes errors. The errors are arising in a third-party tool the call is coding. This tool is supposed to be thread safe, but it looks strongly as though they're arising when two threads try and perform the same operation at the same time.

So I went back to the prototype. Previously I'd only looked at this to see how to talk to the third-party tool. But when I examined the called code, I discovered my predecessor had threaded it using Task and Action, operators with which I'm not familiar.

Action<object> MyCleaningOperation = (object obj) =>
{
    // invoke the third-party tool.
}

public void Main()
{
    Task[] taskCollection = new Task[1];
    for (int i = 0; i < rowCollection.Length; i++)
    {
        taskCollection[i] = new Task(MyCleaningOperation, i);
    }

    foreach (var task in taskCollection)
    {
        task.Start();
    }

    try
    {
        Task.WaitAll(taskCollection);
    }
    catch (Exception ex)
    {
        throw ex;
    }
}

Now, that's not great code but it is a prototype. Allegedly his prototype did not error and ran at a greater speed than mine. I cannot verify this because his prototype was dependent on a dev database that no longer exists.

I don't particularly want to go on a wild goose chase of trying out different kinds of threading in my app to see if some throw errors or not - they're intermittent so it would be a long drawn out process. More so because having read about Task I cannot see any reason why it would work more effectively than Parallel. And because I'm using a void function I cannot easily add an await to mimic the prototype operation.

So: is there an operational difference between the two? Or any other reason why one might cause a tool to trip up with multiple threads using the same resource and the other not?

Bob Tway
  • 9,301
  • 17
  • 80
  • 162
  • Let me quote Jon Skeet: Parallel Extensions uses an appropriate number of cores, based on how many you physically have and how many are already busy. It allocates work for each core and then uses a technique called work stealing to let each thread process its own queue efficiently and only need to do any expensive cross-thread access when it really needs to. Source: http://stackoverflow.com/questions/1114317/does-parallel-foreach-limits-the-number-of-active-threads. Basically task approach will run them all in parallel while parallel.foreach will limit number of threads created – MistyK Apr 07 '17 at 10:05
  • Are you asking, on the overview level, what the difference is between a thread and a task is, or are you asking "the code in the question, if it were to use threads instead, what would the practical difference be?"? – Lasse V. Karlsen Apr 07 '17 at 10:26
  • If you're asking "why did my approach not work but his does" then that is quite impossible to answer. You could start with posting those errors your code caused, but even then it might be impossible. It could simply be that the task (thread) based code would also cause these errors but that it manages to stagger the starting of the tasks in such a way that it doesn't, whereas Parallel.ForEach more aggressively executes the code in parallel. – Lasse V. Karlsen Apr 07 '17 at 10:29
  • I can see nothing in the code (at least when skimming it) that indicates that the bigger example should work whereas the Parallel.ForEach should not. There is more to that specific problem than the code in the question. – Lasse V. Karlsen Apr 07 '17 at 10:30
  • @LasseV.Karlsen Thank you for your comments. I'd like both: to understand the difference and to know why one might work and the other not. I don't entirely trust the word of my predecessor: knowing him it's possible he didn't test it properly. The error message isn't relevant in this case - it's peculiar to the third party app. – Bob Tway Apr 07 '17 at 10:36
  • You should start with this: Is it at all possible to execute that 3rd party library/application in parallel with itself? – Lasse V. Karlsen Apr 07 '17 at 10:40
  • @LasseV.Karlsen According to the documentation, yes. The errors I'm getting back are intermittent. They get more frequent the more threads one invokes. They are also not consistent - i.e. one of several different errors can appear and crash the threaded appliaction. – Bob Tway Apr 07 '17 at 10:44

3 Answers3

3

Action<T> is a void-returning delegate which takes a T. It represents an operation which consumes a T, produces nothing, and is started when invoked.

Task<T> is what it says on the tin: it represents a job that is possibly not yet complete, and when it is complete, it provides a T to its completion.

So, let's make sure you've got it so far: what is the completion of a Task<T>?

Don't read on until you've sussed it out.

.

.

.

.

.

The completion of a task is an action. A task produces a T in the future; an action performs an action on that T when it is available.

All right, so then what is a Task, no T? A task that does not produce a value when it completes. What's the completion of a Task? Plainly an Action.

How can we describe the task performed by a Task then? It does something but produces no result. So that's an Action. Suppose the task requires that it consumes an object to do its work; then that's an Action<object>.

Make sure you understand the relationships here. They are a bit tricky but they all make sense. The names are carefully chosen.

So what then is a thread? A thread is a worker that can do tasks. Do not confuse tasks with threads.

having read about Task I cannot see any reason why it would work more effectively than Parallel.

You see what I mean I hope. This sentence makes no sense. Tasks are just that: tasks. Deliver this book to this address. Add these numbers. Mow this lawn. Tasks are not workers, and they are certainly not the concept of "hire a bunch of workers to do tasks". Parallelism is a strategy for assigning workers to tasks.

Moreover, do not fall into the trap of believing that tasks are inherently parallel. There is no requirement that tasks be performed simultaneously by multiple workers; much of the work we've done in C# in the past few years has been to ensure that tasks may be performed efficiently by a single worker. If you need to make breakfast, mow the lawn and pick up the mail, you don't need to hire a staff to do those things, but you can still pick up the mail while the toast is toasting.

You should examine carefully your claim that the best way to increase performance is to parallelize. Remember, parallelization is simply hiring as many workers as there are CPU cores to run them, and then handing out tasks to each. This is only an improvement if (1) the tasks can actually be run in parallel, independently, (2) the tasks are gated on CPU, not I/O, and (3) you can write programs that are correct in the face of multiple threads of execution in the same program.

If your tasks really are "embarrassingly parallel" and can run completely independently of each other then you might consider process parallelism rather than thread parallelism. It's safer and easier.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • You said that "parallelization ... is only an improvement if ... tasks are gated by CPU, not I\O". Why is that? Say I need to make 10 web requests (IO work) - if I parallelize them I will certainly get an improvement in time needed to complete them all, compared to sequential execution one by one. – Evk Apr 07 '17 at 10:20
  • @Evk: No, you get benefits by *running them asynchronously*. I need ten magazines. Which is more efficient: One: an asynchronous nonparallel workflow where I fill out ten subscription cards and send them off to the publishers, and then asynchronously receive each magazine as it arrives, or Two: I hire ten secretaries, each secretary fills out one subscription card, and then I pay them to sleep beside the mailbox waiting for their magazine to arrive? – Eric Lippert Apr 07 '17 at 10:23
  • @Evk: Thread parallelism gives you no benefit for I/O based tasks, only costs. But now consider another workflow: I have a thousand letters to answer. Now it *does* make sense to hire ten secretaries, and schedule those thousand tasks to those ten workers to do in parallel; odds are pretty good they will do it around ten times faster than I would alone. Use parallelism for CPU-bound tasks. – Eric Lippert Apr 07 '17 at 10:25
  • Yes I understand the difference, but I mean maybe term "parallelization" is a bit confusing here. You fill ten cards and send to publishers and then work is performed by those publishers in parallel still. In your program there are no 10 threads but work in some sense is still performed in parallel. So I mean reading this someone might think that it's better to perform IO work in sequential manner (execute - wait - execute - wait). – Evk Apr 07 '17 at 10:26
  • @Evk: I agree it is confusing. I just typed "parallel processing" into a search engine and got "a mode of operation in which a process is split into parts, which are executed simultaneously on different processors attached to the same computer." and "The simultaneous use of more than one CPU to execute a program." and "breaking up and running program tasks on multiple microprocessors" and so on; we generally use parallelism to describe workflows that are farmed out to *local CPU cores*, not to I/O devices. – Eric Lippert Apr 07 '17 at 10:30
2

The errors are arising in a third-party tool the call is coding. This tool is supposed to be thread safe, but it looks strongly as though they're arising when two threads try and perform the same operation at the same time.

If that's correct, then parallel tasks won't prevent errors any more than Parallel will.

But when I examined the called code, I discovered my predecessor had threaded it using Task and Action, operators with which I'm not familiar.

That code looks OK, though it does use the task constructor combined with Start, which would be more elegantly expressed with Task.Run.

The prototype is using a dynamic task-based parallelism approach, which is overkill for this situation. Your code is using parallel loops, which is more appropriate for data parallelism (see Selecting the Right Pattern and Figure 1).

Allegedly his prototype did not error and ran at a greater speed than mine.

If the error is due to a multithreading but in the third-party tool, then the prototype was just as susceptible to those errors. Perhaps it was using an earlier version of the tool, or the data in the dev database did not expose the bug, or it just got lucky.

Regarding performance, I would expect Parallel to have superior performance to plain task parallelism in general, because Parallel can "batch" operations among tasks, reducing the overhead. Though that extra logic does come with a cost, too, so for small data sizes it could be less performant.

IMO the bigger question is the correctness, and if it fails with Parallel, then it could just as easily fail with parallel tasks.

Stephen Cleary
  • 437,863
  • 77
  • 675
  • 810
1

On the surface, the difference between a task and a thread is this:

  • A thread is one of the ways you can involve the operating system and the processor in how to have the computer do more than one thing at a time, by having something that is scheduled on the processor and allow it to execute, potentially (and these days, most often) at the same time that other things execute, simply because the processors of today can do more than one thing at the same time
  • A task, in the context of Task or Task<T>, on the other hand, is the representation of something that has the potential of completing at some point in the future, and then represent the result of that completion

That's basically it.

Sure, you can wrap a thread in a task, but if your question is just "what is the difference between a thread and a task" then the above is it.

You can easily represent things that have nothing to do with threads or even parallel execution of code in a task and it would still be a task. Asynchronous I/O uses tasks heavily these days and most of those (at least the good implementations) doesn't use (extra) threads at all.

Lasse V. Karlsen
  • 380,855
  • 102
  • 628
  • 825