1

My code works for most cases. It fails when the site redirects to a new URL. For example the URL: http://www.oil-india.com/ redirects to http://www.oil-india.com/oilnew/ in the browser. With JSoup the below code fails to retrieve links from the original URL.

doc = Jsoup.connect(url).timeout(0).userAgent(USER_AGENT).validateTLSCertificates(false).followRedirects(true).get();

Elements subLinks = doc.select("a[href]");
Frederic Klein
  • 2,846
  • 3
  • 21
  • 37
user2849678
  • 613
  • 7
  • 15
  • Check a response code to do your condition [link](http://stackoverflow.com/questions/6467848/how-to-get-http-response-code-for-a-url-in-java) – Many_question Jan 27 '17 at 08:44

1 Answers1

0

If you print out the document you will notice, that the redirect is done using javascript:

[...]
window.location.href = '../oilnew/'; 
[...]

You could parse the script tag manually and when finding window.location.href either check if it is triggered on load and extract the target or use HtmlUnit (though it is quite slow) to follow the redirects.

Example Code

String userAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36";
String url = "http://www.oil-india.com/";

Document doc;
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

final WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setRedirectEnabled(true);

try {
    url = webClient.getPage(url).getUrl().toString(); // HtmlUnit
    doc = Jsoup.connect(url).userAgent(userAgent).followRedirects(true).get(); // jsoup
    System.out.println(doc.toString());
} catch (FailingHttpStatusCodeException | IOException e) {
    e.printStackTrace();
}

Output

<a href="#" class="close">Close</a>
<a href="default.aspx"><img src="oilindia-img/logo.jpg" alt="Oil India" style="height:95px;"></a>
 <a href="screenreader.aspx"><img src="oilindia-img/screen_reader_icon.png" style="vertical-align:middle;" alt="top"><span id="MenuBarTop_link_screenreader" class="link_screenreader">Screen Reader Access</span> </a>
<a href="javascript:decreaseFontSize();" class="toplink"> <img alt="orange color" src="oilindia-img/a-.png" id="Img1"> </a>
[...]
Frederic Klein
  • 2,846
  • 3
  • 21
  • 37