1

Given a URL (String ref), I am attempting to retrieve the redirected URL as follows:

        HttpURLConnection con = (HttpURLConnection)new URL(ref).openConnection();
        con.setInstanceFollowRedirects(false);
        con.setRequestProperty("User-Agent","");
        int responseType = con.getResponseCode()/100;
        while (responseType == 1)
        {
            Thread.sleep(10);
            responseType = con.getResponseCode()/100;
        }
        if (responseType == 3)
            return con.getHeaderField("Location");
        return con.getURL().toString();

I am having several (conceptual and technical) problems with it:

Conceptual problem:

  • It works in most cases, but I don't quite understand how.
  • All methods of the 'con' instance are called AFTER the connection is opened (when 'con' is instanciated).
  • So how do they affect the actual result?
  • How come calling 'setInstanceFollowRedirects' affects the returned value of 'getHeaderField'?
  • Is there any point calling 'getResponseCode' over and over until the returned value is not 1xx?
  • Bottom line, my general question here: is there another request/response sent through the connection every time one of these methods is invoked?

Technical problem:

  • Sometimes the response-code is 3xx, but 'getHeaderField' does not return the "final" URL.
  • I tried calling my code with the returned value of 'getHeaderField' until the response-code was 2xx.
  • But in most other cases where the response-code is 3xx, 'getHeaderField' DOES return the "final" URL, and if I call my code with this URL then I get an empty string.

Can you please advise how to approach the two problems above in order to have a "100% proof" code for retrieving the "final" URL?

Please ignore cases where the response-code is 4xx or 5xx (or anything else other than 1xx / 2xx / 3xx for that matter).

Thanks

barak manos
  • 29,648
  • 10
  • 62
  • 114
  • What does your exception handling look like? Maybe the code quietly ignores an exception which could tell you more about a possible reason for the problems. Please also post the try..catch or throws part. – Daniel S. Dec 27 '13 at 18:46
  • Everything mentioned above refers to cases where no exception is thrown. The entire code is properly encapsulated with try/catch, returning "" upon any exception. But I'm not interested in solving exceptions in the scope of this question, since (as I said) the problem described occurs under the "normal execution path". – barak manos Dec 27 '13 at 19:03

2 Answers2

2

Conceptual problems:

0.) Can one URLConnection or HttpURLConnection object be reused?

No, you can not reuse such an object. You can use it to fetch the content of one URL just once. You can not use it to retrieve another URL, nor to fetch the content twice (speaking on the network level).

If you want to fetch another URL or to fetch the URL a second time, you have to call the openConnection() method of the URL class again to instanciate a new connection object.

1.) When is the URLConnection actually connected?

The method name openConnection() is misleading. It only instanciates the connection object. It does not do anything on the network level.

The interaction on the network level starts in this line, which implicitly connects the connection (= the TCP socket under the hood is opened and data is sent and received):

int responseType = con.getResponseCode()/100;

.

Alternatively, you can use HttpURLConnection.connect() to explicitly connect the connection.

2.) How does setInstanceFollowRedirects work?

setInstanceFollowRedirects(true) causes the URLs to be fetched "under the hood" again and again until there is a non-redirect response. The response code of the non-redirect response is returned by your call to getResponseCode().

UPDATE:
Yes, this allows to write simple code if you do not want to bother about the redirects yourself. You can simply switch on to follow redirects and then you can read the final response of the location to which you get redirected as if there was no redirect taking place.

Daniel S.
  • 6,458
  • 4
  • 35
  • 78
  • So what you're really saying is, that I should do the exact opposite of what I'm doing? Call setInstanceFollowRedirects(true), then call getResponseCode(), which will return anything besides 3xx, and then, assuming that the response-code is 2xx, simply return con.getURL().toString()? – barak manos Dec 27 '13 at 18:43
  • @barakmanos, I've edited my answer to more specifically address your comment. – Daniel S. Dec 27 '13 at 18:56
  • Thank you Daniel S. So I am returning con.getURL().toString() as soon as the response-code is not 1xx (assuming that it has to be 2xx at that point). But I still have cases where the returned value is not the final location. Is there any chance that I need to continue polling the response-code until it is 2xx (instead of polling it until is it not 1xx)? In other words, is there any chance that 3xx is returned for some some period of time? – barak manos Dec 27 '13 at 19:10
  • @barakmanos Btw, did you ever receive a 1xx response code? I've never seen it getting used. Is this some ajax stuff? What kind of thing do you write there? If it's not something extremely fancy, you probably don't need to care about the 1xx codes at all. – Daniel S. Dec 27 '13 at 19:14
  • I have never received 1xx, but I read about it in the HTTP standard, and initially I thought that was my problem so I added it. I'm working on the server side so no (client-side) AJAX. Just trying to find the landing-page of ads on the web. I am now polling the response-code as long as it's 1 or 3. Do you think it will help me retrieve the landing-page in 100% of all cases? – barak manos Dec 27 '13 at 19:17
  • @barakmanos Then remove any handling of 1xx, set to follow redirects, remove any handling of 3xx, remove the polling and sleeping alltogether (you don't need any of this) and try it again. What problem do you have then? Post it to a new question on SO and link them to one another. – Daniel S. Dec 27 '13 at 19:19
0

I would be more careful in evaluating the response code. Not every 3xx-code is automatically a kind of redirection. For example the code 304 just stands for "Not modified."

Look at the original definitions here.

Meno Hochschild
  • 42,708
  • 7
  • 104
  • 126
  • This is a good point, but I would put this as a comment, as it does, at least not clearly, address any of the OP's questions. – Daniel S. Dec 27 '13 at 18:36
  • I checked the 3xx issue, just didn't want to "overload" my question with tedious details. In any case, it is always 301 or 302, so the problem must be lying elsewhere. – barak manos Dec 27 '13 at 18:40
  • Okay, sometimes I answer too short. Sorry for that. I agree, comment form would have been more appropriate. My contribution was especially meant with regards to this question: 'Sometimes the response-code is 3xx, but 'getHeaderField' does not return the "final" URL.' – Meno Hochschild Dec 27 '13 at 18:43