4

Given a input text file containing the Urls, I would like to download the corresponding files all at once. I use the answer to this question UserState using WebClient and TaskAsync download from Async CTP as reference.

public void Run()
{
    List<string> urls = File.ReadAllLines(@"c:/temp/Input/input.txt").ToList();

    int index = 0;
    Task[] tasks = new Task[urls.Count()];
    foreach (string url in urls)
    {
        WebClient wc = new WebClient();
        string path = string.Format("{0}image-{1}.jpg", @"c:/temp/Output/", index+1);
        Task downloadTask = wc.DownloadFileTaskAsync(new Uri(url), path);
        Task outputTask = downloadTask.ContinueWith(t => Output(path));
        tasks[index] = outputTask;
    }
    Console.WriteLine("Start now");
    Task.WhenAll(tasks);
    Console.WriteLine("Done");

}

public void Output(string path)
{
    Console.WriteLine(path);
}

I expected that the downloading of the files would begin at the point of "Task.WhenAll(tasks)". But it turns out that the output look likes

c:/temp/Output/image-2.jpg
c:/temp/Output/image-1.jpg
c:/temp/Output/image-4.jpg
c:/temp/Output/image-6.jpg
c:/temp/Output/image-3.jpg
[many lines deleted]
Start now
c:/temp/Output/image-18.jpg
c:/temp/Output/image-19.jpg
c:/temp/Output/image-20.jpg
c:/temp/Output/image-21.jpg
c:/temp/Output/image-23.jpg
[many lines deleted]
Done

Why does the downloading begin before WaitAll is called? What can I change to achieve what I would like (i.e. all tasks will begin at the same time)?

Thanks

Community
  • 1
  • 1
Dom
  • 41
  • 1
  • 2

3 Answers3

5

Why does the downloading begin before WaitAll is called?

First of all, you're not calling Task.WaitAll, which synchronously blocks, you're calling Task.WhenAll, which returns an awaitable which should be awaited.

Now, as others said, when you call an async method, even without using await on it, it fires the asynchronous operation, because any method conforming to the TAP will return a "hot task".

What can I change to achieve what I would like (i.e. all tasks will begin at the same time)?

Now, if you want to defer execution until Task.WhenAll, you can use Enumerable.Select to project each element to a Task, and materialize it when you pass it to Task.WhenAll:

public async Task RunAsync()
{
    IEnumerable<string> urls = File.ReadAllLines(@"c:/temp/Input/input.txt");

    var urlTasks = urls.Select((url, index) =>
    {
        WebClient wc = new WebClient();
        string path = string.Format("{0}image-{1}.jpg", @"c:/temp/Output/", index);

        var downloadTask = wc.DownloadFileTaskAsync(new Uri(url), path);
        Output(path);

        return downloadTask;
    });

    Console.WriteLine("Start now");
    await Task.WhenAll(urlTasks);
    Console.WriteLine("Done");
}
Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
0

Why does the downloading begin before WaitAll is called?

Because:

Tasks created by its public constructors are referred to as “cold” tasks, in that they begin their life cycle in the non-scheduled TaskStatus.Created state, and it’s not until Start is called on these instances that they progress to being scheduled. All other tasks begin their life cycle in a “hot” state, meaning that the asynchronous operations they represent have already been initiated and their TaskStatus is an enumeration value other than Created. All tasks returned from TAP methods must be “hot.”

Since DownloadFileTaskAsync is a TAP method, it returns "hot" (that is, already started) task.

What can I change to achieve what I would like (i.e. all tasks will begin at the same time)?

I'd look at TPL Data Flow. Something like this (I've used HttpClient instead of WebClient, but, actually, it doesn't matter):

    static async Task DownloadData(IEnumerable<string> urls)
    {
        // we want to execute this in parallel
        var executionOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = Environment.ProcessorCount };

        // this block will receive URL and download content, pointed by URL
        var donwloadBlock = new TransformBlock<string, Tuple<string, string>>(async url =>
        {
            using (var client = new HttpClient())
            {
                var content = await client.GetStringAsync(url);
                return Tuple.Create(url, content);
            }
        }, executionOptions);

        // this block will print number of bytes downloaded
        var outputBlock = new ActionBlock<Tuple<string, string>>(tuple =>
        {
            Console.WriteLine($"Downloaded {(string.IsNullOrEmpty(tuple.Item2) ? 0 : tuple.Item2.Length)} bytes from {tuple.Item1}");
        }, executionOptions);

        // here we tell to donwloadBlock, that it is linked with outputBlock;
        // this means, that when some item from donwloadBlock is being processed, 
        // it must be posted to outputBlock
        using (donwloadBlock.LinkTo(outputBlock))
        {
            // fill downloadBlock with input data
            foreach (var url in urls)
            {
                await donwloadBlock.SendAsync(url);
            }

            // tell donwloadBlock, that it is complete; thus, it should start processing its items
            donwloadBlock.Complete();
            // wait while downloading data
            await donwloadBlock.Completion;
            // tell outputBlock, that it is completed
            outputBlock.Complete();
            // wait while printing output
            await outputBlock.Completion;
        }
    }

    static void Main(string[] args)
    {
        var urls = new[]
        {
            "http://www.microsoft.com",
            "http://www.google.com",
            "http://stackoverflow.com",
            "http://www.amazon.com",
            "http://www.asp.net"
        };

        Console.WriteLine("Start now.");
        DownloadData(urls).Wait();
        Console.WriteLine("Done.");

        Console.ReadLine();
    }

Output:

Start now.
Downloaded 1020 bytes from http://www.microsoft.com
Downloaded 53108 bytes from http://www.google.com
Downloaded 244143 bytes from http://stackoverflow.com
Downloaded 468922 bytes from http://www.amazon.com
Downloaded 27771 bytes from http://www.asp.net
Done.

Dennis
  • 37,026
  • 10
  • 82
  • 150
-1

What can I change to achieve what I would like (i.e. all tasks will begin at the same time)?

To synchronize the beginning of the download you could use Barrier class.

  public void Run()
  {
      List<string> urls = File.ReadAllLines(@"c:/temp/Input/input.txt").ToList();


      Barrier barrier = new Barrier(url.Count, ()=> {Console.WriteLine("Start now");} );

      Task[] tasks = new Task[urls.Count()];

      Parallel.For(0, urls.Count, (int index)=>
      {
           string path = string.Format("{0}image-{1}.jpg", @"c:/temp/Output/", index+1);
          tasks[index] = DownloadAsync(Uri(urls[index]), path, barrier);        
      })


      Task.WaitAll(tasks); // wait for completion
      Console.WriteLine("Done");
    }

    async Task DownloadAsync(Uri url, string path, Barrier barrier)
    {
           using (WebClient wc = new WebClient())
           {
                barrier.SignalAndWait();
                await wc.DownloadFileAsync(url, path);
                Output(path);
           }
    }
alexm
  • 6,854
  • 20
  • 24
  • One shouldn't use `Parallel` with TAP. E.g., look here: http://stackoverflow.com/a/23139769/580053. Your using of `Barrier` blocks thread and must be avoided, since this violates the principles, on which TAP is built. – Dennis Sep 04 '15 at 06:27
  • @Dennis: Yes, it is not scalable and less efficient than using TPL.DataFlow, yet it does not lead to deadlocks for small number of items. So why did you downvote? – alexm Sep 04 '15 at 06:31
  • @Dennis: the link you provided talks about async void, which is a different matter entirely. – alexm Sep 04 '15 at 06:34
  • Because you're using wrong tool to achieve, what OP wants, and using it wrong. I thought, that my previous comment explains this clearly, doesn't it? – Dennis Sep 04 '15 at 06:34