5

I'm trying to use JSoup to scrape some pages that are on a staging server. To view the pages on the staging server with a browser I need to be connected to a VPN.

I am connected to the VPN but when I use JSoup to try to scrape the page it keeps timing out. How can I make my program use the VPN connection. Or is there something else here I'm not thinking of?

Note: I also make use of HttpClient in another part of my program. Is there a way I can set my program to connect to the VPN/Proxy once the program initialises so both JSoup and HttpClient use the VPN/Proxy.

Thanks

Peck3277
  • 1,383
  • 8
  • 22
  • 46
  • 1
    If you have `HttpClient` running over proxy you can use it to download the website into a string and parse this one (like solution #2 in my answer). – ollo Feb 05 '13 at 19:49

3 Answers3

9

You can set java properties for the proxy:

// if you use https, set it here too
System.setProperty("http.proxyHost", "<proxyip>"); // set proxy server
System.setProperty("http.proxyPort", "<proxyport>"); // set proxy port

Document doc = Jsoup.connect("http://your.url.here").get(); // Jsoup now connects via proxy

or download the website into a string and parse it then:

final URL website = new URL("http://your.url.here"); // The website you want to connect

// -- Setup connection through proxy
Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("<proxyserver>", 1234)); // set proxy server and port
HttpURLConnection httpUrlConnetion = (HttpURLConnection) website.openConnection(proxy);
httpUrlConnetion.connect();

// -- Download the website into a buffer
BufferedReader br = new BufferedReader(new InputStreamReader(httpUrlConnetion.getInputStream()));
StringBuilder buffer = new StringBuilder();
String str;

while( (str = br.readLine()) != null )
{
    buffer.append(str);
}

// -- Parse the buffer with Jsoup
Document doc = Jsoup.parse(buffer.toString());

You can use HttpClient for this solution as well.

ollo
  • 24,797
  • 14
  • 106
  • 155
6

As of version 1.9 you can set it on the connection: https://jsoup.org/apidocs/org/jsoup/Connection.html#proxy-java.net.Proxy-

JSoup.connect("http://your.url.here").proxy("<proxy-host>", <proxy-port>).get();
Luís Soares
  • 5,726
  • 4
  • 39
  • 66
Kees de Kooter
  • 7,078
  • 5
  • 38
  • 45
3

To add on for ollo if your proxy needs username/password authentication.

final String authUser = <username>;
final String authPassword = <password>;
Authenticator.setDefault(
   new Authenticator() {
      public PasswordAuthentication getPasswordAuthentication() {
         return new PasswordAuthentication(
               authUser, authPassword.toCharArray());
      }
   }
);

System.setProperty("http.proxyHost", <yourproxyhost>);
System.setProperty("http.proxyPort", <yourproxyport>);
System.setProperty("http.proxyUser", authUser);
System.setProperty("http.proxyPassword", authPassword);

Document doc = Jsoup.connect("http://your.url.here").get();
rtyusolf
  • 293
  • 2
  • 7