1

I'm writing web scraping program and here I have situation when I have 10 links on one page, and for every link I need to download html text to scrape data from them, and move on next page and repeat all process. When I do this synchronously for one link it took 5-10 sec to download html text (it is slow when I try open page with browser). So I looked for asynchronous way to implement this, and for 10 links it took 5-10 sec to download html text. I have to loop through 100 pages and it took 30 minutes to process all data.

I don't have too much experience with Tasks in C#, so I made this code and it works, but I'm not sure is it good or there exists better solution?

class Program
{
    public static List<Task> tasks = new List<Tasks>();
    public static List<Data> webData = new List<Data>();

    public static async Task<string> GetHtmlText(string link)
    {
        using (HttpClient client = new HttpClient())
        {
            return await client.GetStringAsync(link);
        }
    }

    public static void Main(string[] args)
    {

        for(int i = 0; i < 100; i++)
        {
            List<string> links = GetLinksFromPage(i); // returns 10 links from page //replaced with edit solution >>>
            
            foreach (var link in links)
            {
                Task<string> task= Task.Run(() => GetHtmlText(link));
                TaskList.Add(task);
            }

            Task.WaitAll(TaskList.ToArray()); // replaced with edit solution <<<

            foreach(Task<string> task in TaskList)
            {
                string html = task.Result;
                Data data = GetDataFromHtml(html);
                webData.Add(data);
            }
            ...
        }
}

EDIT: This made my day, setting DefaultConnectionLimit to 50 DefaultConnectionLimit

ServicePointManager.DefaultConnectionLimit = 50

var concurrentBag = new ConcurrentBag<string>();

var t = linksFromPage.Select(async link =>
        {
             var response = await GetLinkStringTaskAsync(link);
             concurrentBag.Add(response);
         });

await Task.WhenAll(t);
TJacken
  • 354
  • 3
  • 12
  • 3
    you can change 1) the statement `Task task= Task.Run(() => GetHtmlText(link));` to `Task task= GetHtmlText(link);`. You don't need to wrap a task in another task via `Task.Run` 2) `Task.WaitAll(TaskList.ToArray()); ` to `await Task.WhenAll(TaskList.ToArray());` – user1672994 Aug 01 '20 at 17:32
  • @user1672994 ty for advice :) – TJacken Aug 01 '20 at 17:41
  • You want to reuse the HttpClient for all calls, or better yet, use the HttpClientFactory. Also, you can try running pages in [parallel](https://learn.microsoft.com/en-us/dotnet/api/system.threading.tasks.parallel.foreach?view=netcore-3.1). Whatever gain you get will depend on the number of cores in your machine. – insane_developer Aug 01 '20 at 17:43
  • 1
    @insane_developer regarding your suggestion of using `Parallel.ForEach`, nope, this method is only suitable for parallelizing CPU-bound workloads, and [it is completely unsuitable](https://stackoverflow.com/questions/15136542/parallel-foreach-with-asynchronous-lambda) for parallelizing asynchronous I/O operations. – Theodor Zoulias Aug 01 '20 at 18:01
  • @insane_developer Sorry I don't wrote in my post, I am using .Net framework 4.6 and there is no HttpClientFactory. – TJacken Aug 01 '20 at 18:02
  • 1
    @Anve, take a look at [how to properly use HttpClient](https://aspnetmonsters.com/2016/08/2016-08-27-httpclientwrong/). – insane_developer Aug 01 '20 at 18:19
  • 1
    @TheodorZoulias, yes, you are right, that is what it is meant for. – insane_developer Aug 01 '20 at 18:21
  • @insane_developer ty for this :) – TJacken Aug 01 '20 at 18:24
  • 4
    @IanKemp [Don't use the existence of Code Review as a reason to close a question](//meta.stackoverflow.com/q/287400). From now on evaluate the question and use a reason like; needs focus, primarily opinion-based, etc. – Peilonrayz Aug 01 '20 at 21:56

0 Answers0