Unable to resolve DNS (sometimes?)

Question

Given an application that in parallel requests 100 urls at a time for 10000 urls, I'll receive the following error for 50-5000 of them:

The remote name cannot be resolved 'www.url.com'

I understand that the error means the DNS Server was unable to resolve the url. However, for each run, the number of urls that cannot be resolved changes (ranging from 50 to 5000).

Am I making too many requests too fast? And can I even do that? - Running the same test on a much more powerful server, shows that only 10 urls could not be resolved - which sounds much more realistic.

The code that does the parallel requesting:

var semp = new SemaphoreSlim(100);
var uris = File.ReadAllLines(@"C:\urls.txt").Select(x => new Uri(x));

foreach(var uri in uris)
{
   Task.Run(async () =>
   {
      await semp.WaitAsync();
      var result = await Web.TryGetPage(uri); // Using HttpWebRequest
      semp.Release();
   });   
}

Can you show the most basic code that does this work in parallel? — Haney, Jun 22 '14 at 19:43
It seems like you're likely just overloading the DNS server with concurrent requests, or similarly your `Web` instance/class is static (or has static members) and all tasks are sharing the connection which would have weird results. — Haney, Jun 22 '14 at 19:49
Also when you say running it on a much more powerful web server is more effective, I'd bet it uses a different DNS than your machine. — Haney, Jun 22 '14 at 19:50
@DavidHaney, `Web.TryGetPage` spins up a new `HttpWebRequest` each time - and therefore they should not interfere with each other (I guess). - Both servers are using `Google DNS (8.8.8.8)` - and have the same internet connection speed. The only difference is CPU and Memory. — ebb, Jun 22 '14 at 19:52
[Seems like Google's rate limiting your requests](https://developers.google.com/speed/public-dns/docs/security#rate_limit) — Haney, Jun 22 '14 at 20:01
@DavidHaney, Ah - I see. But I'm not sure that explains why the more powerful server can keep an average of 10 urls that cannot be resolved - while the other server has an average of 1500 urls. — ebb, Jun 22 '14 at 20:05
8.8.8.8 can sustain really high request rates... Way beyond what's described here for years at a time. — spender, Jun 22 '14 at 20:16

spender · Answer 1 · 2016-11-23T23:17:59.000

5

I'll bet that you didn't know that the DNS lookup of HttpWebRequest (which is the cornerstone of all .net http apis) happens synchronously, even when making async requests (annoying, right?). This means that firing off many requests at once causes severe ThreadPool strain and large amount of latency. This can lead to unexpected timeouts. If you really want to step things up, don't use the .net dns implementation. You can use a third party library to resolve hosts and create your webrequest with an ip instead of a hostname, then manually set the host header before firing off the request. You can achieve much higher throughput this way.

edited Nov 23 '16 at 23:17

answered Jun 22 '14 at 20:08

spender

117,338
33
229
351

Answering from a phone, so a code example is off the cards atm. Let me know if this doesn't make sense and I'll add code when I'm in front of a real computer. – spender Jun 22 '14 at 20:13
Makes perfectly sense. This could also explain why I had to "invent" a mechanism, that adjusted the timeout of my web requests every 10sec, in order for them not to timeout. Would `Dns.GetHostEntryAsync` not be sufficent for resolving the hosts, or is this synchronous under the hood too? :-) – ebb Jun 22 '14 at 20:15
1

Also synchronous. It's so bad that this remains an issue years after i 1st noticed it. – spender Jun 22 '14 at 20:19
@ebb If you set the minthreads of the ThreadPool as described in the linked question above and this eases symptoms, then likely this is your problem... Otherwise you're overloading something in the network. – spender Jun 22 '14 at 20:28
I've tried setting `ThreadPool.MinThreads(250, 250)` with no effect. – ebb Jun 22 '14 at 20:34

O. Jones · Answer 2 · 2014-06-22T20:25:14.617

It does sound like you're swamping your local DNS server (in the jargon, your local recursive DNS resolver).

When your program issues a DNS resolution request, it sends a port 53 datagram to the local resolver. That resolver responds either by replying from its cache or recursively resending the request to some other resolver that's been identified as possibly having the record you're looking for.

So, your multithreaded program is causing a lot of datagrams to fly around. Internet Protocol hosts and routers handle congestion and overload by dropping datagram packets. It's like handling a traffic jam on a bridge by bulldozing cars off the bridge. In an overload situation, some packets just disappear.

So, it's up to endpoint software using datagram protocols to try again if their packets get lost. That's the purpose of TCP, and that's how it can provide the illusion of an error-free stream of data even though it can only communicate with datagrams.

So, your program will need to try again when you get resolution failure on some of your DNS requests. You're a datagram endpoint so you own the responsibility of retry. I suspect the .net library is give you back failure when some of your requests time out because your datagrams got dropped.

Now, here's the important thing. It is also the responsibility of a datagram endpoint program, like yours, to implement congestion control. TCP does this automatically using its sliding window system, with an algorithm called slow-start / exponential backoff. If TCP didn't do this all internet routers would be congested all the time. This algorithm was dreamed up by Van Jacobson, and you should go read about it.

In the meantime you should implement a simple form of it in your bulk DNS lookup program. Here's how you might do that.

Start with a batch size of, say, 5 lookups.
Every time you get the whole batch back successfully, increase your batch size by one for your next batch. This is slow-start. As long as you're not getting congestion, you increase the network load.
Every time you get a failure to resolve a name, reduce the size of the next batch by half. So, for example, if your batch size was 30 and you got a failure, your next batch size will be 15. This is exponential backoff. You respond to congestion by dramatically reducing the load you're putting on the network.
Implement a maximum batch size of something like 100 just to avoid being too much of a pig and looking like a crude denial-of-service attack to the DNS system.

I had a similar project a while ago and this strategy worked well for me.

Thansk for your answer :-) - Is there any way, I can make sure that it is in fact the local DNS server that is causing the problem? — ebb, Jun 22 '14 at 21:16
Also, why would a server with more CPU and Memory capacity be able to do more lookups at a time, and thereby preventing an overload situation? — ebb, Jun 22 '14 at 21:50

Unable to resolve DNS (sometimes?)

2 Answers2