1

I want to scrape a webpage using Javas Jsoup library but i am behind a corporate proxy which prevents me from connecting to the webpage. I researched the problem and know now that I have to specifically address the proxy as well as identify myself to the proxy. However I am still not able to connect to the webpage. I am trying to test my connection by simply retrieving the title from www.google.com using the following code:

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Test {
    public static void main(String[] args) {
        System.out.println("1");
        try{            
            System.setProperty("http.proxyHost", "myProxy");
            System.setProperty("http.proxyPort", "myPort");
                  System.setProperty("http.proxyUser", "myUser");
            System.setProperty("http.proxyPassword", "myPassword");

            Document doc = Jsoup.connect("http://google.com").get();
                  String title = doc.title();
            System.out.println(title);

            }catch(IOException e){
                      System.out.println(e);
            }           
        }
    }

The above code returns the following error:

org.jsoup.UnsupportedMimeTypeException: Unhandled content type. Must be text/*, application/xml, or application/xhtml+xml. Mimetype=application/x-ns-proxy-autoconfig, URL=http://google.com

This tells me that soemthing was retrieved but is in a content type that can not be processed, so I adjusted "Test" to ignore the content type, in order to see what is retrieved using the following code:

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class DemoII {
    public static void main(String[] args) {
         System.out.println("1");
        try{
                   System.setProperty("http.proxyHost", "myProxy");
             System.setProperty("http.proxyPort", "myPort");
             System.setProperty("http.proxyUser", "myUser");
             System.setProperty("http.proxyPassword", "myPassword");

             String script = Jsoup.connect("http://google.com").ignoreContentType(true).execute().body();
             System.out.println(script);
                    }catch(IOException e){
                      System.out.println(e);
              }         
    }
}

It turns out that the "script" string retrieves source code from the proxy server. So I am making some connection to the proxy but my request for www.google.com is not going through. Any ideas what I am doing wrong?

  • What does the proxy's response say? – MCL Apr 16 '14 at 09:32
  • sorry, i cant post the source code due to company policy. in general terms i get back java code that defines how the proxy operates. – user3182273 Apr 16 '14 at 09:39
  • Why Java code? Could it be that it returns a [Proxy auto-config](https://en.wikipedia.org/wiki/Proxy_auto-config)? And I'm sure it wouldn't be a problem to black out confidential information, would it? – MCL Apr 16 '14 at 09:46
  • What does `Document doc = Jsoup.connect("http://google.com").ignoreContentType(true).get();` yield? – StoopidDonut Apr 16 '14 at 09:48
  • @MCL yes it looks to me as if you are right and i get back a Proxy auto-config. – user3182273 Apr 16 '14 at 09:54
  • @PopoFibo if i use 'Document doc = Jsoup.connect("http://google.com").ignoreContentType(true).get()' doc will contain the Proxy auto-config. – user3182273 Apr 16 '14 at 09:58
  • @MCL hey thanks, i had no idea what this file did and after you told me what it does, i had a look inside and there was a proxy name that slightly differs from the one i used before, and now it works - Thanks – user3182273 Apr 16 '14 at 10:17
  • Great. Please post your solution as an answer and accept it, so that other people running into the same problem can benefit. – MCL Apr 16 '14 at 10:52
  • Also, it will be *cleaner* if you parse the PAC every time your program starts. This way, you won't have to worry about the "internet proxy" IP changing. Check out [this answer](http://stackoverflow.com/a/10325919/1282023). – MCL Apr 16 '14 at 11:23

1 Answers1

0

OP finds a solution:

@MCL hey thanks, i had no idea what this file did and after you told me what it does, i had a look inside and there was a proxy name that slightly differs from the one i used before, and now it works - Thanks – user3182273

Community
  • 1
  • 1
Stephan
  • 41,764
  • 65
  • 238
  • 329