28

I'm writing a C# console application that scrapes data from web pages.

This application will go to about 8000 web pages and scrape data(same format of data on each page).

I have it working right now with no async methods and no multithreading.

However, I need it to be faster. It only uses about 3%-6% of the CPU, I think because it spends the time waiting to download the html.(WebClient.DownloadString(url))

This is the basic flow of my program

DataSet alldata;

foreach(var url in the8000urls)
{
    // ScrapeData downloads the html from the url with WebClient.DownloadString
    // and scrapes the data into several datatables which it returns as a dataset.
    DataSet dataForOnePage = ScrapeData(url);

    //merge each table in dataForOnePage into allData
}

// PushAllDataToSql(alldata);

Ive been trying to multi thread this but am not sure how to properly get started. I'm using .net 4.5 and my understanding is async and await in 4.5 are made to make this much easier to program but I'm still a little lost.

My idea was to just keep making new threads that are async for this line

DataSet dataForOnePage = ScrapeData(url);

and then as each one finishes, run

//merge each table in dataForOnePage into allData

Can anyone point me in the right direction on how to make that line async in .net 4.5 c# and then have my merge method run on complete?

Thank you.

Edit: Here is my ScrapeData method:

public static DataSet GetProperyData(CookieAwareWebClient webClient, string pageid)
{
    var dsPageData = new DataSet();

    // DOWNLOAD HTML FOR THE REO PAGE AND LOAD IT INTO AN HTMLDOCUMENT
    string url = @"https://domain.com?&id=" + pageid + @"restofurl";
    string html = webClient.DownloadString(url);
    var doc = new HtmlDocument();
    doc.LoadHtml(html );

    // A BUNCH OF PARSING WITH HTMLAGILITY AND STORING IN dsPageData 
    return dsPageData ;
}
Noctis
  • 11,507
  • 3
  • 43
  • 82
Kyle
  • 32,731
  • 39
  • 134
  • 184
  • 4
    http://msdn.microsoft.com/en-us/library/hh556530(v=vs.110).aspx – dugas Jul 24 '12 at 20:36
  • 1
    Take a look at PLinq the8000urls.AsParallel().ForAll(...). http://msdn.microsoft.com/en- – asawyer Jul 24 '12 at 20:37
  • 1
    @asawyer `AsParallel` will work, but it will be somewhat wasteful, in that it spawns threads to wait on inherently async operations. Granted, it's easier and can work, but there are more elegant solutions. – casperOne Jul 24 '12 at 21:19

4 Answers4

42

If you want to use the async and await keywords (although you don't have to, but they do make things easier in .NET 4.5), you would first want to change your ScrapeData method to return a Task<T> instance using the async keyword, like so:

async Task<DataSet> ScrapeDataAsync(Uri url)
{
    // Create the HttpClientHandler which will handle cookies.
    var handler = new HttpClientHandler();

    // Set cookies on handler.

    // Await on an async call to fetch here, convert to a data
    // set and return.
    var client = new HttpClient(handler);

    // Wait for the HttpResponseMessage.
    HttpResponseMessage response = await client.GetAsync(url);

    // Get the content, await on the string content.
    string content = await response.Content.ReadAsStringAsync();

    // Process content variable here into a data set and return.
    DataSet ds = ...;

    // Return the DataSet, it will return Task<DataSet>.
    return ds;
}

Note that you'll probably want to move away from the WebClient class, as it doesn't support Task<T> inherently in its async operations. A better choice in .NET 4.5 is the HttpClient class. I've chosen to use HttpClient above. Also, take a look at the HttpClientHandler class, specifically the CookieContainer property which you'll use to send cookies with each request.

However, this means that you will more than likely have to use the await keyword to wait for another async operation, which in this case, would more than likely be the download of the page. You'll have to tailor your calls that download data to use the asynchronous versions and await on those.

Once that is complete, you would normally call await on that, but you can't do that in this scenario because you would await on a variable. In this scenario, you are running a loop, so the variable would be reset with each iteration. In this case, it's better to just store the Task<T> in an array like so:

DataSet alldata = ...;

var tasks = new List<Task<DataSet>>();

foreach(var url in the8000urls)
{
    // ScrapeData downloads the html from the url with 
    // WebClient.DownloadString
    // and scrapes the data into several datatables which 
    // it returns as a dataset.
    tasks.Add(ScrapeDataAsync(url));
}

There is the matter of merging the data into allData. To that end, you want to call the ContinueWith method on the Task<T> instance returned and perform the task of adding the data to allData:

DataSet alldata = ...;

var tasks = new List<Task<DataSet>>();

foreach(var url in the8000urls)
{
    // ScrapeData downloads the html from the url with 
    // WebClient.DownloadString
    // and scrapes the data into several datatables which 
    // it returns as a dataset.
    tasks.Add(ScrapeDataAsync(url).ContinueWith(t => {
        // Lock access to the data set, since this is
        // async now.
        lock (allData)
        {
             // Add the data.
        }
    });
}

Then, you can wait on all the tasks using the WhenAll method on the Task class and await on that:

// After your loop.
await Task.WhenAll(tasks);

// Process allData

However, note that you have a foreach, and WhenAll takes an IEnumerable<T> implementation. This is a good indicator that this is suitable to use LINQ, which it is:

DataSet alldata;

var tasks = 
    from url in the8000Urls
    select ScrapeDataAsync(url).ContinueWith(t => {
        // Lock access to the data set, since this is
        // async now.
        lock (allData)
        {
             // Add the data.
        }
    });

await Task.WhenAll(tasks);

// Process allData

You can also choose not to use query syntax if you wish, it doesn't matter in this case.

Note that if the containing method is not marked as async (because you are in a console application and have to wait for the results before the app terminates) then you can simply call the Wait method on the Task returned when you call WhenAll:

// This will block, waiting for all tasks to complete, all
// tasks will run asynchronously and when all are done, then the
// code will continue to execute.
Task.WhenAll(tasks).Wait();

// Process allData.

Namely, the point is, you want to collect your Task instances into a sequence and then wait on the entire sequence before you process allData.

However, I'd suggest trying to process the data before merging it into allData if you can; unless the data processing requires the entire DataSet, you'll get even more performance gains by processing the as much of the data you get back when you get it back, as opposed to waiting for it all to get back.

casperOne
  • 73,706
  • 19
  • 184
  • 253
  • Was typing up a nice long answer and then you went and posted this one :) Nice post, upvoted. – Graymatter Jul 24 '12 at 21:27
  • Thanks for the help. This helps me a lot with half of my problem(waiting for them to all finish then merged), but I'm still confused on how to change my ScrapeData method, because i'm not sure where or how to use await. I was downloading the html with webclient.DownloadString which returns a string. there is a async method called webclient.DownloadStringAsync which returns a void, and the compiler tells me that I cannot use await on void. – Kyle Jul 24 '12 at 21:27
  • @casperOne Thanks for that example. I just posted what I was using before. I will look into HttpClient instead of WebClient, maybe that is the way to go for this.. – Kyle Jul 24 '12 at 21:35
  • I tried writing it with HttpWebRequest instead of HttpClient, because I couldn't find a way to use a cookie with HttpClient and I have to be logged in. I tried running the program, and I can get it to break on `await Task.WhenAll(tasks);` but it exits the program after that instead of processing the lines after. – Kyle Jul 24 '12 at 22:08
  • That's because [it's a console program](http://nitoprograms.blogspot.com/2012/02/async-console-programs.html). Try using [`AsyncContext.RunTask`](http://nitoasyncex.codeplex.com/wikipage?title=AsyncContext). – Stephen Cleary Jul 25 '12 at 00:53
  • @user1308743 Updated the answer to include what you should do instead of using `await` when the containing method is not `async` as well as how to use cookies using `HttpClient`. – casperOne Jul 25 '12 at 13:22
  • @StephenCleary That's a little overkill. Why not just call `Wait`? I mean, that's really all that needs to be accomplished here. – casperOne Jul 25 '12 at 13:23
  • `Wait` is acceptable, as long as you understand it changes your exception handling. – Stephen Cleary Jul 25 '12 at 15:04
  • Thanks @casperOne♦. I'm playing around with a crawler at the moment and will use this in my implementation. – Babak Naffas Aug 09 '12 at 00:30
  • Keep in mind, when working with IEnumerables in Linq that the query itself will not be executed UNTIL it's enumerated (and that it will be executed again EACH TIME it's enumerated). -- Nothing wrong with this example (I think it's awesome) -- but just keep in mind: there be dragons here. -- I recommend just calling await Task.WhenAll([put inline Linq here]); -- so you never have any "var tasks" object to mess you up. ;-) – BrainSlugs83 May 11 '13 at 23:05
  • Also note: WebClient.DownloadStringAsync uses the OLD async model -- you have to handle the DownloadStringCompleted event to use it -- WebClient however does also support the new model -- use the WebClient.DownloadStringTaskAsync method instead -- you'll notice in the intelli-sense documentation that it not only returns a Task object but also that it's declared to be "(awaitable)". – BrainSlugs83 May 11 '13 at 23:09
  • @BrainSlugs83 FYI `Task` instances are awaitable. Also, `WebClient` probably shouldn't be used in most situations anymore in light of `HttpClient`. – casperOne May 12 '13 at 00:32
11

You could also use TPL Dataflow, which is a good fit for this kind of problem.

In this case, you build a "dataflow mesh" and then your data flows through it.

This one is actually more like a pipeline than a "mesh". I'm putting in three steps: Download the (string) data from the URL; Parse the (string) data into HTML and then into a DataSet; and Merge the DataSet into the master DataSet.

First, we create the blocks that will go in the mesh:

DataSet allData;
var downloadData = new TransformBlock<string, string>(
  async pageid =>
  {
    System.Net.WebClient webClient = null;
    var url = "https://domain.com?&id=" + pageid + "restofurl";
    return await webClient.DownloadStringTaskAsync(url);
  },
  new ExecutionDataflowBlockOptions
  {
    MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
  });
var parseHtml = new TransformBlock<string, DataSet>(
  html =>
  {
    var dsPageData = new DataSet();
    var doc = new HtmlDocument();
    doc.LoadHtml(html);

    // HTML Agility parsing

    return dsPageData;
  },
  new ExecutionDataflowBlockOptions
  {
    MaxDegreeOfParallelism = DataflowBlockOptions.Unbounded,
  });
var merge = new ActionBlock<DataSet>(
  dataForOnePage =>
  {
    // merge dataForOnePage into allData
  });

Then we link the three blocks together to create the mesh:

downloadData.LinkTo(parseHtml);
parseHtml.LinkTo(merge);

Next, we start pumping data into the mesh:

foreach (var pageid in the8000urls)
  downloadData.Post(pageid);

And finally, we wait for each step in the mesh to complete (this will also cleanly propagate any errors):

downloadData.Complete();
await downloadData.Completion;
parseHtml.Complete();
await parseHtml.Completion;
merge.Complete();
await merge.Completion;

The nice thing about TPL Dataflow is that you can easily control how parallel each part is. For now, I've set both the download and parsing blocks to be Unbounded, but you may want to restrict them. The merge block uses the default maximum parallelism of 1, so no locks are necessary when merging.

Stephen Cleary
  • 437,863
  • 77
  • 675
  • 810
  • 2
    If this question was asked today, I would have answered with a TPL-based solution instead of [the one I gave](http://stackoverflow.com/a/11639434/50776); it's definitely easier to wire up everything and a lot cleaner. – casperOne Nov 13 '12 at 18:13
  • wouldn't TPL be an overkill here? Wasn't TPL developed mainly mainly for CPU-bound parallel programs? – infinity Jan 04 '13 at 01:33
  • 2
    TPL Dataflow is a `Task`-based asynchronous mesh. It's not actually part of TPL as it exists in .NET, but is an add-on library that was developed by the same team (who also developed the `async` supporting types). – Stephen Cleary Jan 04 '13 at 01:48
  • 2
    @iNfinity That would be incorrect. It's actually very close to it's name. It doesn't have to be CPU bound, you can easily have I/O bound operations be a part of dataflow. It's about breaking down operations into blocks and then linking all of the blocks together, with the ability to control how all of the blocks handle things like parallelism, buffering, etc. It's not overkill at all, IMO, once you get it, the blocks are really easy to put together and you see things in these logical units which fit TPL very well. – casperOne Feb 06 '13 at 12:45
  • Great. Now I have to find out what the heck TPL Dataflow is -- and here I thought I was finally catching up to all the latest stuff! XD – BrainSlugs83 May 11 '13 at 23:12
  • So, how many threads will be running if you were to estimate the average number of pages being downloaded at once? Is it 1, or 2? Or is it determined basically by the CPU speed, in which case can go up to hundreds? (in which case, your internet speed hits a road block much sooner)? So, just to answer my question, how many download instances are occurring at any given point in time with the code above? Doesn't need to be exact, just ball-park figure is ok. I was going to do this a different way, in my next comment, i'll outline splitting this up into 10 or 20 Tasks, where each routine... – Erx_VB.NExT.Coder Sep 24 '13 at 13:03
  • @user1308743 where each routine (the meshes) are all in one class, and you can instantiate this class (say) 20 times, so that each class's foreach loop will skip by 20 (can be 10, or 50). The Skip number (20) and threads set number (20) must be the same. Then, start the task for each thread from a parent/master foreach loop, once started, each thread will do its next bit incrementing by 20 until the 8,000 are finished. You can't access properties in the class directly, but the class can send statistical information to the parent foreach class that's running it, and you can report that. u like? – Erx_VB.NExT.Coder Sep 24 '13 at 13:09
  • @Erx_VB.NExT.Coder: I suggest you ask your own question(s) on SO. – Stephen Cleary Sep 24 '13 at 13:27
1

I recommend reading my reasonably-complete introduction to async/await.

First, make everything asynchronous, starting at the lower-level stuff:

public static async Task<DataSet> ScrapeDataAsync(string pageid)
{
  CookieAwareWebClient webClient = ...;
  var dsPageData = new DataSet();

  // DOWNLOAD HTML FOR THE REO PAGE AND LOAD IT INTO AN HTMLDOCUMENT
  string url = @"https://domain.com?&id=" + pageid + @"restofurl";
  string html = await webClient.DownloadStringTaskAsync(url).ConfigureAwait(false);
  var doc = new HtmlDocument();
  doc.LoadHtml(html);

  // A BUNCH OF PARSING WITH HTMLAGILITY AND STORING IN dsPageData 
  return dsPageData;
}

Then you can consume it as follows (using async with LINQ):

DataSet alldata;
var tasks = the8000urls.Select(async url =>
{
  var dataForOnePage = await ScrapeDataAsync(url);

  //merge each table in dataForOnePage into allData

});
await Task.WhenAll(tasks);
PushAllDataToSql(alldata);

And use AsyncContext from my AsyncEx library since this is a console app:

class Program
{
  static int Main(string[] args)
  {
    try
    {
      return AsyncContext.Run(() => MainAsync(args));
    }
    catch (Exception ex)
    {
      Console.Error.WriteLine(ex);
      return -1;
    }
  }

  static async Task<int> MainAsync(string[] args)
  {
    ...
  }
}

That's it. No need for locking or continuations or any of that.

Stephen Cleary
  • 437,863
  • 77
  • 675
  • 810
-1

I believe you don't need async and await stuff here. They can help in desktop application where you need to move your work to non-GUI thread. In my opinion, it will be better to use Parallel.ForEach method in your case. Something like this:

    DataSet alldata;
    var bag = new ConcurrentBag<DataSet>();

    Parallel.ForEach(the8000urls, url =>
    {
        // ScrapeData downloads the html from the url with WebClient.DownloadString 
        // and scrapes the data into several datatables which it returns as a dataset. 
        DataSet dataForOnePage = ScrapeData(url);
        // Add data for one page to temp bag
        bag.Add(dataForOnePage);
    });

    //merge each table in dataForOnePage into allData from bag

    PushAllDataToSql(alldata); 
Alexander
  • 1,299
  • 9
  • 13
  • This is brute-forcing it. You can do this, but at the same time you're wasting threads waiting on inherently async operations (`Parallel` will spawn threads to handle the partitions of `the8000urls` and then those threads will block when fetching the urls). You don't *need* `async`/`await` but it definitely is more elegant and makes better use of the resources you have. – casperOne Jul 25 '12 at 21:24
  • That's the idea. It is console application and it should be faster. With `async/await` you'll still be downloading one url at time at that's not acceptable. And with `Parallel.ForEach` it is possible to download much more urls at time and thus improving overall application performance. And that's exactly what user1308743 needs. – Alexander Jul 26 '12 at 07:17
  • That's not true. With `async`/`await` they are not loaded one at a time, they are started off asynchronously and all are waited for at the end. Your interpretation of what `async`/`await` does is incorrect. – casperOne Jul 26 '12 at 12:06
  • Hm, it looks like I haven't noticed `List` in your post while reading it for the first time. Returning tasks and awaiting for them in conjunction with `async` / `await` in the method body will definitely be the best choice. – Alexander Jul 26 '12 at 13:56