I am writing a spider programme in Java and I ran into some troubles handling URL redirection. There are two kind of URL redirection I have ran into so far, the first one is those with HTTP response code 3xx which I can take care follow this answer.
But the second kind is that the server return HTTP response code 200 with a page that contain only some JavaScript code like this:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script>
function detectmob() {
var u=(document.URL);
if( navigator.userAgent.match(/Android/i) || some other browser...){
window.location.href="web/mobile/index.php";
} else {
window.location.href="web/desktop/index.php";
}
}
detectmob();
</script>
</head>
<body></body></html>
If the original URL is http://example.com, then it will automatically redirect to http://example.com/web/desktop/index.php if I am using a desktop web browser with JavaScript enabled.
However, my spider checks HttpURLConnection#getResponseCode()
to see if it has reached the final URL by getting HTTP response code 200
and use URLConnection#getHeaderField()
to get the Location
field if HTTP response code 3xx
is received. The following are the code snippet of my spider:
public String getFinalUrl(String originalUrl) {
try {
URLConnection con = new URL(originalUrl).openConnection();
HttpURLConnection hCon = (HttpURLConnection) con;
hCon.setInstanceFollowRedirects(false);
if(hCon.getResponseCode() == HttpURLConnection.HTTP_MOVED_PERM
|| hCon.getResponseCode() == HttpURLConnection.HTTP_MOVED_TEMP) {
System.out.println("redirected url: " + con.getHeaderField("Location"));
return getFinalUrl(con.getHeaderField("Location"));
}
} catch (IOException ex) {
System.err.println(ex.toString());
}
return originalUrl;
}
So getting the above page will have a HTTP response code 200
and my spider will just assume there will be no further redirection and start parsing the page which is empty in term of content text.
I have google this issue a bit and apparently javax.script
is somehow related, but I have no idea how to make it works. How can I program my spider so it will be able to get the correct URL?