2

I understand the implications of using an async lambda with Parallel.ForEach which is why I'm not using it here. This then forces me to use .Result for each of my Tasks that make Http requests. However, running this simple scraper through the performance profiler shows that .Result has an elapsed exclusive time % of ~98% which is obviously due to the blocking nature of the call.

My question is: is there any possibility of optimizing this for it to still be async? I'm not sure that will help in this case since it may just take this long to retrieve the HTML/XML.

I'm running a 4 core processor with 8 logical cores (hence the MaxDegreesOfParallelism = 8. Right now I'm looking at around 2.5 hours to download and parse ~51,000 HTML/XML pages of simple financial data.

I was leaning towards using XmlReader instead of Linq2XML to speed up the parsing, but it appears the bottleneck is at the .Result call.

And although it should not matter here, the SEC limits scraping to 10 requests/sec.

public class SECScraper
{
    public event EventHandler<ProgressChangedEventArgs> ProgressChangedEvent;

    public SECScraper(HttpClient client, FinanceContext financeContext)
    {
        _client = client;
        _financeContext = financeContext;
    }

    public void Download()
    {
        _numDownloaded = 0;
        _interval = _financeContext.Companies.Count() / 100;

        Parallel.ForEach(_financeContext.Companies, new ParallelOptions {MaxDegreeOfParallelism = 8},
            company =>
            {
                RetrieveSECData(company.CIK);
            });
    }

    protected virtual void OnProgressChanged(ProgressChangedEventArgs e)
    {
        ProgressChangedEvent?.Invoke(this, e);
    }

    private void RetrieveSECData(int cik)
    {
        // move this url elsewhere
        var url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" + cik +
                  "&type=10-q&dateb=&owner=include&count=100";

        var srBody = ReadHTML(url).Result; // consider moving this to srPage
        var srPage = new SearchResultsPage(srBody);

        var reportLinks = srPage.GetAllReportLinks();

        foreach (var link in reportLinks)
        {
            url = SEC_HOSTNAME + link;

            var fdBody = ReadHTML(url).Result;
            var fdPage = new FilingDetailsPage(fdBody);

            var xbrlLink = fdPage.GetInstanceDocumentLink();

            var xbrlBody = ReadHTML(SEC_HOSTNAME + xbrlLink).Result;
            var xbrlDoc = new XBRLDocument(xbrlBody);
            var epsData = xbrlDoc.GetAllEPSData();

            //foreach (var eps in epsData)
            //    Console.WriteLine($"{eps.StartDate} to {eps.EndDate} -- {eps.EPS}");
        }

        IncrementNumDownloadedAndNotify();
    }

    private async Task<string> ReadHTML(string url)
    {
        using var response = await _client.GetAsync(url);
        return await response.Content.ReadAsStringAsync();
    }
}
keelerjr12
  • 1,693
  • 2
  • 19
  • 35
  • 2
    If the remote site rate limits you to 10 requests/sec then the minimum theoretical time you could retrieve 51k requests is in 85 minutes which isn't far off your 125 minute mark. I would say given the overhead of that many requests would make up the difference of 10-20 minutes thus I don't think there's much you can do other than try to remove / increase the limit or download more data per request. – Kieran Devlin Jul 02 '19 at 22:26
  • a) There is no need to use `Parallel.ForEach` https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/async/how-to-make-multiple-web-requests-in-parallel-by-using-async-and-await https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/async/how-to-extend-the-async-walkthrough-by-using-task-whenall b) You likely want to increase `DefaultConnectionLimit` https://stackoverflow.com/questions/48785681/the-operation-has-timed-out-at-system-net-httpwebrequest-getresponse-while-sen/48785949#48785949 c) Once b) is done, use `Semaphore` to limit concurrent downloads. – mjwills Jul 03 '19 at 02:42
  • [Here is an answer](https://stackoverflow.com/a/56862796/11178549) to a similar question about limiting the amount of concurrent async I/O operations, that uses a custom implementation of `Task.WhenAll`. – Theodor Zoulias Jul 03 '19 at 04:27

2 Answers2

3

The task is not CPU bound, but rather network bound so there is no need to use multiple threads.

Make multiple async calls on one thread. just don't await them. Put the tasks on a list. When you get a certain amount there (say you want 10 going at once), start waiting for the first one to finish (Look up 'task, WhenAny' for more info).

Then put more on :-) You can then control the size of the lits of tasks by #/second using other code.

FastAl
  • 6,194
  • 2
  • 36
  • 60
  • 1
    @keelerjr12 just to clarify why you can run it on the same thread async, `HttpClient` has its own connection pool thus it can run multiple connections at once. – Kieran Devlin Jul 02 '19 at 22:32
  • @FastAl, I understand that it is IO bound. However, do multiple threads not help in this case where I have to parse HTML/XML using HTML Agility Pack and Linq2XML? With 8 logical cores, it seems that I could leverage those to each parse different documents that are returned after the Http request. Therefore, being somewhat CPU bound. – keelerjr12 Jul 03 '19 at 11:32
  • @keelerjr12, yes that could slow things down. Check out the generic blocking collection you could use that or parallel queuing - you would download everything with the multiple Asyncs above on one thread and put in the queue for other threads to process. BUT PRACTICALLY, chances are the downloading still takes longer and you'd get no gain, because this is the only real work your thread is doing. An easy way to find out is this - see if the one CPU you're using is pegged in monitoring. (might have to set up affinity?) – FastAl Jul 08 '19 at 14:37
1

is there any possibility of optimizing this for it to still be async?

Yes. I'm not sure why you're using Parallel in the first place; it seems like the wrong solution for this kind of problem. You have asynchronous work to do across a collection of items, so a better fit would be asynchronous concurrency; this is done using Task.WhenAll:

public class SECScraper
{
  public async Task DownloadAsync()
  {
    _numDownloaded = 0;
    _interval = _financeContext.Companies.Count() / 100;

    var tasks = _financeContext.Companies.Select(company => RetrieveSECDataAsync(company.CIK)).ToList();
    await Task.WhenAll(tasks);
  }

  private async Task RetrieveSECDataAsync(int cik)
  {
    var url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" + cik +
        "&type=10-q&dateb=&owner=include&count=100";

    var srBody = await ReadHTMLAsync(url);
    var srPage = new SearchResultsPage(srBody);

    var reportLinks = srPage.GetAllReportLinks();

    foreach (var link in reportLinks)
    {
      url = SEC_HOSTNAME + link;

      var fdBody = await ReadHTMLAsync(url);
      var fdPage = new FilingDetailsPage(fdBody);

      var xbrlLink = fdPage.GetInstanceDocumentLink();

      var xbrlBody = await ReadHTMLAsync(SEC_HOSTNAME + xbrlLink);
      var xbrlDoc = new XBRLDocument(xbrlBody);
      var epsData = xbrlDoc.GetAllEPSData();
    }

    IncrementNumDownloadedAndNotify();
  }

  private async Task<string> ReadHTMLAsync(string url)
  {
    using var response = await _client.GetAsync(url);
    return await response.Content.ReadAsStringAsync();
  }
}

Also, I recommend you use IProgress<T> for reporting progress.

Stephen Cleary
  • 437,863
  • 77
  • 675
  • 810
  • I actually used a similar solution to what you’re proposing initially but did not rate limit the HTTP requests so I was banned. I will use the solution you proposed and then decorate the HttpClient to limit my requests to 10 requests/sec. – keelerjr12 Jul 03 '19 at 02:42
  • in my Getxxx() methods, I am actually parsing the html and xml, do multiple threads not help in this case? To me it seems it is also CPU bound. – keelerjr12 Jul 03 '19 at 12:04
  • @keelerjr12: CPU-bound, yes, but is it CPU-intensive? In my experience, parsing html is extremely fast. The only way to know whether parallelism would help there is to measure it both ways. If you do want to explore adding parallelism, I would recommend either TPL Dataflow or using Channels with Parallel; that way, your asynchronous work is separate from your CPU-intensive work. – Stephen Cleary Jul 03 '19 at 12:13