3

consider the following url http://www.google.com/url?rct=j&sa=t&url=http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings&ct=ga&cd=CAIyHWU3NmVhMGQ0NWQ3MmRmY2I6Y29tOmVuOlVTOlJM&usg=AFQjCNE_8XwECqkmyPIMzcSxCDh2hP16wQ. When i pass this url to JSOUP, the html content is not accurate. But when i open this url in browser, it will rediect to http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings.

Then, i passed this url to jsoup, now i am getting the exact html content.

How can i get the exact html content from the first url??

I have tried many options

        Response response = Jsoup.connect(url).followRedirects(true).timeout(timeOut*1000).userAgent(userAgent).execute();
        int status = response.statusCode();
        if (status == HttpURLConnection.HTTP_MOVED_TEMP || status == HttpURLConnection.HTTP_MOVED_PERM || status == HttpURLConnection.HTTP_SEE_OTHER) {
            redirectUrl = response.header("location");
            response = Jsoup.connect(redirectUrl).followRedirects(false).timeout(timeOut*1000).userAgent(userAgent).execute();
        }
        Document doc=response.parse();

I tried many user agents, .referrer("http://google.com") options etc. I am currently using jsoup version 1.8.3.

din_oops
  • 698
  • 1
  • 9
  • 27

1 Answers1

3

Google returns an html page with a JavaScript/META redirect:

<script>window.googleJavaScriptRedirect=1</script><script>var n={navigateTo:function(b,a,d){if(b!=a&&b.google){if(b.google.r){b.google.r=0;b.location.href=d;a.location.replace("about:blank");}}else{a.location.replace(d);}}};n.navigateTo(window.parent,window,"http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings");
</script><noscript><META http-equiv="refresh" content="0;URL='http://www.ksat.com/news/father-of-woman-killed-in-memorial-day-floods-testifies-for-better-flood-warnings'"></noscript>

That is different from HTTP redirect headers and since Jsoup does not interpret JavaScript you are out of luck.

However, you can of course parse this to get the real link. This is of course already possible without accessing Google, since the link is part of the parameters in the original URL.

luksch
  • 11,497
  • 6
  • 38
  • 53
  • 1
    This SO answer provides a ready made method for parsing URL parameters: http://stackoverflow.com/a/13592567/363573 – Stephan Dec 18 '15 at 09:02