3

I'm playing around with async and await in C# in a simple little console application. My goal is simple: To process a list of files in asynchronous manner, so that the processing of one file does not block the processing of others. None of the files are dependent on one-another, and there are (let's say) thousands of files to go through.

Here's is the code I have currently.

public class MyClass
{
    public void Go()
    {
        string[] fileSystemEntries = Directory.GetFileSystemEntries(@"Path\To\Files");

        Console.WriteLine("Starting to read from files!");
        foreach (var filePath in fileSystemEntries.OrderBy(s => s))
        {
            Task task = new Task(() => DoStuff(filePath));
            task.Start();
            task.Wait();
        }
    }

    private async void DoStuff(string filePath)
    {
        await Task.Run(() =>
        {
            Thread.Sleep(1000);
            string fileName = Path.GetFileName(filePath);
            string firstLineOfFile = File.ReadLines(filePath).First();
            Console.WriteLine("{0}: {1}", fileName, firstLineOfFile);
        });
    }
}

And my Main() method simply invokes this class:

public static class Program
{
    public static void Main()
    {
        var myClass = new MyClass();
        myClass.Go();
    }
}

There's some piece to this asynchronous programming patten that I seem to be missing, though, since whenever I run the program, it seems random how many files are actually processed, anywhere from none of them to all six of them (in my example file set).

Basically, the main thread isn't waiting for all of the files to be processed, which I suppose is part of the point of asynchronously-running things, but I don't quite want that. All I want is: Process as many of these files in as many threads as you can, but still wait for them all to complete processing before finishing up.

Doctor Blue
  • 3,769
  • 6
  • 40
  • 63
  • 1
    You start and wait for each task in your foreach loop ... Create an array of tasks and use WaitAll. – David Brabant May 03 '14 at 13:11
  • 3
    There're a few conceptual issues with this code, but the major technical one is this: `new Task(() => DoStuff(filePath))`, where `DoStuff` is an `async void` method. You're doing a fire-and-forget call here, the tasks get completed before `DoStuff` methods have finished, and so does `myClass.Go()`. – noseratio May 03 '14 at 13:14
  • Processing items in parallel is a very basic concurrency topic. In fact your question is about the most basic scenario possible as far as I can tell. It's probably best if you do some research in that direction. You'll quickly find a solution. – usr May 04 '14 at 12:00
  • possible duplicate of [Parallel.ForEach vs Task.Factory.StartNew](http://stackoverflow.com/questions/5009181/parallel-foreach-vs-task-factory-startnew) – usr May 04 '14 at 12:01
  • I have arbitrarily picked one of the ~100 questions that would answer this and suggested it as a duplicate. – usr May 04 '14 at 12:01
  • @usr, I think the main point the OP is missing here is that he should be using async IO for that, without either `Parallel.ForEach` or `Task.Factory.StartNew` at all, and let the rest of processing to take place on an IOCP thread. Not that there is no duplicates for this, though. – noseratio May 04 '14 at 12:20
  • @Noseratio whatever style he picks, "C# process items in parallel" must return a wealth of information. – usr May 04 '14 at 12:32
  • @usr Funnily enough, my "async" searches (as well as SO's automatic searches when creating the question) weren't fruitful, but had I used "parallel" you probably are correct. – Doctor Blue May 04 '14 at 12:44
  • 1
    @Scott yeah since the async/await stuff has been introduced the concepts have become ambiguous and misleading to a beginner. I see async/await used all the time when synchronous threading would have been simpler and accomplished the same thing. – usr May 04 '14 at 12:46
  • @usr and I am using `Parallel.ForEach` now, but I am unsure whether to: Allow this question to be marked as duplicate, edit my self-answer below, create a second self-answer, or mark my answer as community wiki. – Doctor Blue May 04 '14 at 13:07
  • 2
    This question is distinct enough now to coexist with the existing material. If you feel like it, add to your existing answer. The point being that this question becomes more useful to future visitors. Accept the answer that you think will be most helpful to others. – usr May 04 '14 at 13:11

2 Answers2

6

One of the major design goals behind async/await was to facilitate the use of naturally asynchronous I/O APIs. In this light, your code might be rewritten like this (untested):

public class MyClass
{
    private int filesRead = 0;

    public void Go()
    {
        GoAsync().Wait();
    }

    private async Task GoAsync()
    {
        string[] fileSystemEntries = Directory.GetFileSystemEntries(@"Path\To\Files");

        Console.WriteLine("Starting to read from files! Count: {0}", fileSystemEntries.Length);

        var tasks = fileSystemEntries.OrderBy(s => s).Select(
            fileName => DoStuffAsync(fileName));
        await Task.WhenAll(tasks.ToArray());

        Console.WriteLine("Finish! Read {0} file(s).", filesRead);
    }

    private async Task DoStuffAsync(string filePath)
    {
        string fileName = Path.GetFileName(filePath);
        using (var reader = new StreamReader(filePath))
        {
            string firstLineOfFile = 
                await reader.ReadLineAsync().ConfigureAwait(false);
            Console.WriteLine("[{0}] {1}: {2}", Thread.CurrentThread.ManagedThreadId, fileName, firstLineOfFile);
            Interlocked.Increment(ref filesRead);
        }
    }
}

Note, it doesn't spawn any new threads explicitly, but that may be happening behind the scene with await reader.ReadLineAsync().ConfigureAwait(false).

Doctor Blue
  • 3,769
  • 6
  • 40
  • 63
noseratio
  • 59,932
  • 34
  • 208
  • 486
3

I combined the comments from above in order to reach my solution. Indeed, I didn't need to use the async or await keywords at all. I merely had to create a list of tasks, start them all, then call WaitAll. Nothing need be decorated with the async or await keywords. Here is the resulting code:

public class MyClass
{
    private int filesRead = 0;

    public void Go()
    {
        string[] fileSystemEntries = Directory.GetFileSystemEntries(@"Path\To\Files");

        Console.WriteLine("Starting to read from files! Count: {0}", fileSystemEntries.Length);
        List<Task> tasks = new List<Task>();
        foreach (var filePath in fileSystemEntries.OrderBy(s => s))
        {
            Task task = Task.Run(() => DoStuff(filePath));
            tasks.Add(task);
        }
        Task.WaitAll(tasks.ToArray());
        Console.WriteLine("Finish! Read {0} file(s).", filesRead);
    }

    private void DoStuff(string filePath)
    {
        string fileName = Path.GetFileName(filePath);
        string firstLineOfFile = File.ReadLines(filePath).First();
        Console.WriteLine("[{0}] {1}: {2}", Thread.CurrentThread.ManagedThreadId, fileName, firstLineOfFile);
        filesRead++;
    }
}

When testing, I added Thread.Sleep calls, as well as busy loops to peg the CPUs on my machine. Opening Task Manager, I observed all of the cores being pegged during the busy loops, and every time I run the program, the files are run in an inconsistent order (a good thing, since that shows that the only bottleneck is the number of available threads).

Every time I run the program, fileSystemEntries.Length always matched filesRead.

EDIT: Based on the comment discussion above, I've found a cleaner (and, based on the linked question in the comments, more efficient) solution is to use Parallel.ForEach:

public class MyClass
{
    private int filesRead;

    public void Go()
    {
        string[] fileSystemEntries = Directory.GetFileSystemEntries(@"Path\To\Files");

        Console.WriteLine("Starting to read from files! Count: {0}", fileSystemEntries.Length);
        Parallel.ForEach(fileSystemEntries, DoStuff);
        Console.WriteLine("Finish! Read {0} file(s).", filesRead);
    }

    private void DoStuff(string filePath)
    {
        string fileName = Path.GetFileName(filePath);
        string firstLineOfFile = File.ReadLines(filePath).First();
        Console.WriteLine("[{0}] {1}: {2}", Thread.CurrentThread.ManagedThreadId, fileName, firstLineOfFile);
        filesRead++;
    }
}

It seems there are many ways to approach asynchronous programming in C# now. Between Parallel and Task and async/await, there's a lot to choose from. Based upon this thread, it looks like the best solution for me is Parallel, as it provides the cleanest solution, is more efficient than manually creating Task objects myself, and does not clutter the code with async and await keywords while acheiving similar results.

Doctor Blue
  • 3,769
  • 6
  • 40
  • 63