I want to scrape a webpage using Javas Jsoup library but i am behind a corporate proxy which prevents me from connecting to the webpage. I researched the problem and know now that I have to specifically address the proxy as well as identify myself to the proxy. However I am still not able to connect to the webpage. I am trying to test my connection by simply retrieving the title from www.google.com using the following code:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Test {
public static void main(String[] args) {
System.out.println("1");
try{
System.setProperty("http.proxyHost", "myProxy");
System.setProperty("http.proxyPort", "myPort");
System.setProperty("http.proxyUser", "myUser");
System.setProperty("http.proxyPassword", "myPassword");
Document doc = Jsoup.connect("http://google.com").get();
String title = doc.title();
System.out.println(title);
}catch(IOException e){
System.out.println(e);
}
}
}
The above code returns the following error:
org.jsoup.UnsupportedMimeTypeException: Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml. Mimetype=application/x-ns-proxy-autoconfig, URL=http://google.com
This tells me that soemthing was retrieved but is in a content type that can not be processed, so I adjusted "Test" to ignore the content type, in order to see what is retrieved using the following code:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class DemoII {
public static void main(String[] args) {
System.out.println("1");
try{
System.setProperty("http.proxyHost", "myProxy");
System.setProperty("http.proxyPort", "myPort");
System.setProperty("http.proxyUser", "myUser");
System.setProperty("http.proxyPassword", "myPassword");
String script = Jsoup.connect("http://google.com").ignoreContentType(true).execute().body();
System.out.println(script);
}catch(IOException e){
System.out.println(e);
}
}
}
It turns out that the "script" string retrieves source code from the proxy server. So I am making some connection to the proxy but my request for www.google.com is not going through. Any ideas what I am doing wrong?