0

I have a function that is meant to check if a specific HTTP(S) URL is a redirect and if so return the new location (but not recursively). It uses the requests library. It looks like this:

    try:
        response = http_session.head(sent_url, timeout=(1, 1))
        if response.is_redirect:
            return response.headers["location"]
        return sent_url
    except requests.exceptions.Timeout:
        return sent_url

Here, the URL I am checking is sent_url. For reference, this is how I create the session:

http_session = requests.Session()
http_adapter = requests.adapters.HTTPAdapter(max_retries=0)
http_session.mount("http://", http_adapter)
http_session.mount("https://", http_adapter)

However, one of the requirements of this program is that this must work for dead links. Based off of this, I set a connection timeout (and read timeout for good measures). After playing around with the values, it still takes about 5-10 seconds for the request to fail with this stacktrace no matter what value I choose. (Maybe relevant: in the browser, it gives DNS_PROBE_POSSIBLE.)

Now, my problem is: 5-10 seconds is too long to wait for if a link is dead. There are many links that this program needs to check, and I do not want a few dead links to be such a large bottleneck, hence I want to configure this DNS lookup timeout.

I found this post which seems to be relevant (OP wants to increase the timeout, I want to decrease it) however the solution does not seem applicable. I do not know the IP addresses that these URLs point to. In addition, this feature request from years ago seems relevant, but it did not help me further.

So far, the best solution to me seems to just spin up a coroutine for each link/a batch of links and then suck up the timeout asynchronously.

I am on Windows 10, however this code will be deployed on an Ubuntu server. Both use Python 3.8.

So, how can I best give my HTTP requests a very low DNS resolution timeout in the case that it is being fed a dead link?

Paul Hübner
  • 3
  • 1
  • 3
  • It does not seem like the correct approach. A DNS timeout is not the same as a dead link. A DNS response, basically, will not guarantee you a live or deadlink (so why bother with it?). You might want to change it to a request time out (basically, how long are you willing to wait to obtain a response from the server before you consider it 'dead') - The issue with this is that if you set it too low, you will get false positives for links that are alive but slow to respond (think large image files, for example). Something like `response = requests.get(url, timeout=5) ## Waiting for 5 seconds` – blurfus Nov 06 '21 at 01:27
  • @blurfus In any case, I am struggling to implement any sort of timeout. Setting both the connection and read timeout in the timeout tuple does not change anything when the link is "dead". I am aware of the possibility of false positives, they do not pose a huge problem and I can compromise on them. In any case, I am using HEAD requests so why would the response time be impacted proportionally worse by large images? – Paul Hübner Nov 06 '21 at 01:31
  • I was more thinking in general (for requests that are legitimately longer than a whatever timeout you define) - for HEAD requests, the server still has to do all the processing to craft a response. The difference being that only the headers are returned (and not the entire resource) – blurfus Nov 06 '21 at 01:38
  • I noticed you are adding the timeout tuple to the session and not the request itself (I am not good at python but that seemed odd to me) - is that intended?, did you try this: `response = requests.get(url, timeout=5)` ? – blurfus Nov 06 '21 at 01:43
  • Yes, that's intended @blurfus. I tried both of those, the tuple notation is just to separate both the connect timeout and read timeout (that is specified in the timeout documentation I linked). Neither of them worked since the DNS resolution is the problem. – Paul Hübner Nov 06 '21 at 14:21

1 Answers1

0

So, how can I best give my HTTP requests a very low DNS resolution timeout in the case that it is being fed a dead link?

Separate things.

Use urllib.parse to extract the hostname from the URL, and then use dnspython to resolve that name, with whatever timeout you want.

Then, and only if the resolution was correct, fire up requests to grab the HTTP data.

@blurfus: in requests you can only use the timeout parameter in the HTTP call, you can't attach it to a session. It is not spelled out explicitly in the documentation, but the code is quite clear on that.

There are many links that this program needs to check,

That is a completely separate problem in fact, and exists even if all links are ok, it is just a problem of volume.

The typical solutions fell in two cases:

  • use asynchronous libraries (they exist for both DNS and HTTP), where your calls are not blocking, you get the data later, so you are able to do something else
  • use multiprocessing or multithreading to parallelize things and have multiple URLs being tested at the same time by separate instances of your code.

They are not completely mutually exclusive, you can find a lot of pros and cons for each, asynchronous codes may be more complicated to write and understand later, so multiprocessing/multithreading is often the first step for a "quick win" (especially if you do not need to share anything between the processes/threads, otherwise it becomes quickly a problme), yet asynchronous handling of everything makes the code scales more nicely with the volume.

Patrick Mevzek
  • 10,995
  • 16
  • 38
  • 54