45

I am a beginner to Java and my first task is to parse some 10,000 URLs and extract some info out of it, for this I am using Jsoup and it's working fine.

But now I want to add proxy support to it. The proxies have a username and password too.

halfer
  • 19,824
  • 17
  • 99
  • 186
Himanshu
  • 1,433
  • 4
  • 24
  • 35
  • Hmm, have you tried using HtmlUnit instead? That should be up to the task – raven Sep 20 '11 at 09:16
  • http://stackoverflow.com/questions/120797/how-do-i-set-the-proxy-to-be-used-by-the-jvm – swanliu Sep 20 '11 at 09:18
  • yah i have used but i still think that jsoup works better for my requirement. All i m confused is how to work efficiently with proxies using jsoup. – Himanshu Sep 20 '11 at 09:50

7 Answers7

70

You can easily set proxy

System.setProperty("http.proxyHost", "192.168.5.1");
System.setProperty("http.proxyPort", "1080");
Document doc = Jsoup.connect("www.google.com").get();
Stephan
  • 41,764
  • 65
  • 238
  • 329
Yusuf Ismail Oktay
  • 875
  • 1
  • 6
  • 6
  • 1
    and don't forget to set to null after the call otherwise the other calls that don't need it will be very slow – Dejell Dec 24 '14 at 09:35
  • If you use [code]Jsoup.connect("www.google.com").get() now, it gives a MalformedURLException. – Aditya K Dec 01 '15 at 06:45
  • 10
    I couldnt seem to get this setting to work. Jsoup.connect would successfully complete regardless of the IP i set the proxyHost to. – vjuliano Jan 15 '16 at 20:33
  • 4
    @jln646v The following answer describes how to setup Jsoup with a proxy: http://stackoverflow.com/a/34943161/363573. – Stephan Aug 02 '16 at 10:10
  • 2
    I agree with @jln646v. Better to use `Jsoup.proxy(proxy)`. – jpllosa Aug 30 '16 at 03:35
  • This is the only solution if you need to connect to https. Just change "http" with "https". +1. Proxy class doesn't support Proxy.Type.HTTPS, so all answers that suggest that way aren't feasible if you need to connect to https. – Luke Nov 29 '17 at 13:27
53

Jsoup 1.9.1 and above: (recommended approach)

// Fetch url with proxy
Document doc = Jsoup //
               .connect("http://www.example.com/") //
               .proxy("127.0.0.1", 8080) // sets a HTTP proxy
               .userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2") //
               .header("Content-Language", "en-US") //
               .get();

You may use also the overload Jsoup#proxy which takes a Proxy class (see below).

Before Jsoup 1.9.1: (verbose approach)

// Setup proxy
Proxy proxy = new Proxy(                                      //
        Proxy.Type.HTTP,                                      //
        InetSocketAddress.createUnresolved("127.0.0.1", 8080) //
);

// Fetch url with proxy
Document doc = Jsoup //
               .connect("http://www.example.com/") //
               .proxy(proxy) //
               .userAgent("Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2") //
               .header("Content-Language", "en-US") //
               .get();

References:

Stephan
  • 41,764
  • 65
  • 238
  • 329
  • Hi,I'm using JSoup 1.9.1 - I tried both the approaches mentioned in the answer but to no results. I'm working under a proxy, but I can access the url mentioned in connect method. I'm getting the same error everytime - `java.net.ConnectException: Connection refused: connect` – Nakul Sharma May 02 '19 at 12:31
  • I tried the approach mentioned by Alex too https://stackoverflow.com/a/27819164/4941819 It didn't help either. – Nakul Sharma May 02 '19 at 12:33
  • @NakulSharma The error `java.net.ConnectException: Connection refused: connect` may indicate that the target host doesn't listen or refuses some clients connections. – Stephan Jun 27 '20 at 04:17
40

You don't have to get the webpage data through Jsoup. Here's my solution, it may not be the best though.

  URL url = new URL("http://www.example.com/");
  Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("127.0.0.1", 8080)); // or whatever your proxy is
  HttpURLConnection uc = (HttpURLConnection)url.openConnection(proxy);

  uc.connect();

    String line = null;
    StringBuffer tmp = new StringBuffer();
    BufferedReader in = new BufferedReader(new InputStreamReader(uc.getInputStream()));
    while ((line = in.readLine()) != null) {
      tmp.append(line);
    }

    Document doc = Jsoup.parse(String.valueOf(tmp));

And there it is. This gets the source of the html page through a proxy and then parses it with Jsoup.

Ryan
  • 882
  • 10
  • 11
  • good solution. is there any way i can use NIO based fecthing atfter uc.connect? – Develop4Life Jun 26 '15 at 11:56
  • 1
    For the casual reader, it's possible to pass Jsoup the inputstream of the HttpURLConnection directly. See the following answer for details: http://stackoverflow.com/a/42085445/363573 – Stephan Feb 07 '17 at 08:55
  • well can you use cookies using such method for example to download page which needs login pass? I think Jsoup should be always used – user25 Mar 31 '18 at 04:23
6

You might like to add this before running the program

final String authUser = "USERNAME";
final String authPassword = "PASSWORD";



Authenticator.setDefault(
               new Authenticator() {
                  public PasswordAuthentication getPasswordAuthentication() {
                     return new PasswordAuthentication(
                           authUser, authPassword.toCharArray());
                  }
               }
            );

..

System.setProperty("http.proxyHost", "192.168.5.1");
System.setProperty("http.proxyPort", "1080");
..
bitbyter
  • 801
  • 1
  • 8
  • 17
5
System.setProperty("http.proxyHost", "192.168.5.1");
System.setProperty("http.proxyPort", "1080");
Document doc = Jsoup.connect("www.google.com").get();

This is wrong solution, because parsing is usually multithreaded and we usually need to change proxies. This code sets only one proxy for all threads. So better to not use Jsoup.Connection.

Alex Shwarc
  • 822
  • 10
  • 20
  • For using Jsoup.Connection, see the following answer https://stackoverflow.com/a/34943161/363573 – Stephan Mar 25 '19 at 15:11
3

Jsoup does support using proxies, since v1.9.1. Connection class has the following methods:

  • proxy(Proxy p)
  • proxy(String host, int port)

Which you can use it like this:

Jsoup.connect("...url...").proxy("127.0.0.1", 8080);

If you need authentication, you can use the Authenticator approach mentioned by @Navneet Swaminathan or simply set system properties:

System.setProperty("http.proxyUser", "username");
System.setProperty("http.proxyPassword", "password");

or

System.setProperty("https.proxyUser", "username");
System.setProperty("https.proxyPassword", "password");
juzraai
  • 5,693
  • 8
  • 33
  • 47
1

Try this code instead:

URL url = new URL("http://www.example.com/");
Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("127.0.0.1", 8080)); // or whatever your proxy is

HttpURLConnection uc = (HttpURLConnection)url.openConnection(proxy);
hc.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
uc.setRequestProperty("Content-Language", "en-US");
uc.setRequestMethod("GET");
uc.connect();

Document doc = Jsoup.parse(uc.getInputStream());
Stephan
  • 41,764
  • 65
  • 238
  • 329