1

I have list of 10 000 000 urls in text file. Now I open every of them in my await/async method - at the beging the speed is very good (near 10 000 urls / min) but while the program is running it's decreasing to reach 500 urls / min after ~10 hours. When I restart the program and run from begging the situation is the same - fast at beggining and then slower and slower. I'm working on Windows Server 2008 R2. Tested my code at various PC - some results. Can You tell me where is the problem?

 int finishedUrls = 0;
 IEnumerable<string> urls = File.ReadLines("urlslist.txt");
 await urls.ForEachAsync(500, async url =>
    {                        
        Uri newUri;
        if (!Uri.TryCreate(siteUrl, UriKind.Absolute, out newUri)) return false;
        _uri = newUri;
        var timeout = new CancellationTokenSource(TimeSpan.FromSeconds(30));
        string html = "";
        using(var _httpClient = new HttpClient { Timeout = TimeSpan.FromSeconds(30),MaxResponseContentBufferSize = 300000 }) {
            using(var _req = new HttpRequestMessage(HttpMethod.Get, _uri)){
                using( var _response = await _httpClient.SendAsync(_req,HttpCompletionOption.ResponseContentRead,timeout.Token).ConfigureAwait(false)) {

                        if (_response != null &&
                            (_response.StatusCode == HttpStatusCode.OK || _response.StatusCode == HttpStatusCode.NotFound))
                        {
                            using (var cancel = timeout.Token.Register(_response.Dispose))
                            {
                                var rawResponse = await _response.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
                                html = Encoding.UTF8.GetString(rawResponse);
                            }
                        }
                }
            }
        }
        Interlocked.Increment(ref finishedUrls);
    });

http://blogs.msdn.com/b/pfxteam/archive/2012/03/05/10278165.aspx

Justin Muller
  • 1,283
  • 13
  • 21
  • Are these URLs all for the same host? – Jon Skeet Jan 23 '14 at 16:59
  • Are you sure your network can sustain 10000 request/min? Not sure how big responses are, but you may be running into limitation of network (or some other resource). – Alexei Levenkov Jan 23 '14 at 17:01
  • Possibly related: http://stackoverflow.com/questions/10403944/does-httpwebrequests-limit-of-2-connections-per-host-apply-to-httpclient – davisoa Jan 23 '14 at 17:01
  • @JonSkeet No it's unique lists – user3228759 Jan 23 '14 at 17:01
  • @AlexeiLevenkov - yes it can - I test every url for contains some string and after 1 min I have 10 000 urls finished - the network is 200 mb and program use 50% - after some time it use 2%. As I said if restart my program speed is good and netowork use about 100 mb – user3228759 Jan 23 '14 at 17:04
  • 1
    @user3228759: At 10k/min, you could easily be exhausting your ephemeral ports. After use, each TCP/IP port has to "rest" for a bit before it can be used again. – Stephen Cleary Jan 23 '14 at 18:28
  • @StephenCleary Is there possibility to reuse these ports again faster? As I said if I close the program and run it again everything works fine. So if I divide url list to little lists and run every in another process the problem will be solved? – user3228759 Jan 23 '14 at 18:49
  • 1
    No; I suspect it's just the time between closing and restarting the app that would free up those ports. It's system-wide, not a per-process thing. – Stephen Cleary Jan 23 '14 at 18:54
  • Closing and restarting takes 10 sec... so I think if ports will be free after 10 sec there wasn't be problem with speed decrease to 500 url/min – user3228759 Jan 23 '14 at 19:00
  • It also happens when I stop the method with cancalation token and start it again - speed is big at the begining – user3228759 Jan 24 '14 at 11:13

1 Answers1

1

I believe you are exhausting your io completion ports. You need to throttle your requests. If you need higher concurrency than a single box can handle, then distribute your concurrent requests across more machines. I'd suggest using TPL more managing the conncurrency. I ran into this exact same behavior doing similar things. Also, you should absolutely not be disposing your HttpClient per request. Pull that code out and use a single client.

David Peden
  • 17,596
  • 6
  • 52
  • 72
  • I have some generic code that I will try to post up later. It's a refined solution to the original thing I worked on mentioned in my answer above. – David Peden Jan 23 '14 at 17:18