1

I would like to access a webpage using ip and host, in order to save DNS lookup times by having stored values for domains. If via sockets, it'd be done by using sockets, transmitting a GET request of the following syntax:

Socket s = new Socket([string_ip_address], 80);

Then transmitting: Get [file_name] HTTP/1.1\r\n Host: [some_name]

But I would like to use Jsoup. The equivalent command to retrieve a page, saw www.google.com, in Jsoup is:

Jsoup.connect("http://www.google.com").get();

But the provided site name must be the actual name, not IP (because, if my limited understanding is correct, many domains can reside in the same ip address). So, I figured I might try and alter the request made by Jsoup, to include both site name and ip. Since Jsoup uses HttpUrlConnection in it's underlying code (here's a code scrap from the Jsoup library itself, as found here: https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/helper/HttpConnection.java):

HttpURLConnection conn = (HttpURLConnection) req.url().openConnection();

conn.setRequestMethod(req.method().name());
conn.setInstanceFollowRedirects(false);
conn.setConnectTimeout(req.timeout());
conn.setReadTimeout(req.timeout());

if (conn instanceof HttpsURLConnection) {
    if (!req.validateTLSCertificates()) {
         initUnSecureTSL();
         ((HttpsURLConnection)conn).setSSLSocketFactory(sslSocketFactory);
         ((HttpsURLConnection)conn).setHostnameVerifier(getInsecureVerifier());
    }
}

if (req.method().hasBody())
    conn.setDoOutput(true);
if (req.cookies().size() > 0)
    conn.addRequestProperty("Cookie", getRequestCookieString(req));
for (Map.Entry<String, String> header : req.headers().entrySet()) {
    conn.addRequestProperty(header.getKey(), header.getValue());
}

I thought about writing something like this:

Jsoup.connect(ip).header("Host", host);

But this doesn't seem to work. So, is there a known way to use ip + host in Jsoup requests (to spare DNS lookups), or is there some other way to skip the DNS lookup using Jsoup?

Thanks!

EDIT -

Just to be clear: Using sockets with IP and host name - works. For example, trying to fetch the main page of buzzfeed via IP in the following way:

Socket s = new Socket("23.34.229.118", 80);
BufferedReader reader = new BufferedReader(new InputStreamReader(s.getInputStream()));
PrintStream writer = new PrintStream(s.getOutputStream());
writer.println("GET / HTTP/1.0\r\nHost: www.buzzfeed.com\r\n");

String line;
while((line = reader.readLine()) != null)
{
    System.out.println(line);
}

s.close();

Works perfectly fine. But I am unable to access the page via

Jsoup.connect("http://23.34.229.118");

And I am quite sure that's because I need to specify the host somehow, if that's even possible. My attempt with

Jsoup.connect("http://23.34.229.118").header("Host", "buzzfeed.com"); 

failed and I got a 400 error.

amirkr
  • 173
  • 3
  • 15
  • How about ```Jsoup.connect("http://" + "173.194.117.180");``` –  Dec 20 '15 at 08:34
  • @saka1029 That does not work. For example, the IP of www.buzzfeed.com is 23.34.229.118. You can create a connection by invoking Jsoup.connet("http://www.buzzfeed.com") but not by Jsoup.connect("http://23.34.229.118"). You get a 400 error. As I were saying, I suspect this is because the ip is simply not enough to resolve the actual website. Any other ideas? :) – amirkr Dec 20 '15 at 08:48
  • Add "http://" before IP address. –  Dec 20 '15 at 08:51
  • @saka1029, yes, I mistakenly wrote that here without the "http://" prefix, but I did write this in my code. Jsoup.connect("http://23.34.229.118") does not work. – amirkr Dec 20 '15 at 08:53
  • I was not mistaken it seems, stackoverflow just automatically deletes my "http" :) for clarity, Jsoup.connect("http://" + "23.34.229.118") does not work. 400 eror. – amirkr Dec 20 '15 at 08:55
  • before trying to mess with this within your progrm code, why don't you run a local caching DNS server? You will still have DNS lookups, but since they are local it should be fast. If local DNS lookup is a bottleneck, I doubt that JSoup is the right too for you. – luksch Dec 20 '15 at 09:14
  • @luksch That seems a bit above my abilities (at least within reasonable time frames). I don't really know how to do that. Jsoup is good for me since it actually works faster when maintaining ~400 threads all downloading at the same time (I've tried regular HttpUrlConnections too, it was slower). But someone suggested that I'm being restricted (by the internet provider? router? something else?) from conducting 400 DNS lookups at the same time. So I want to test it by storing ip values beforehand, then running 400 threads and downloading. Sorry if it's unclear, I do value the response. – amirkr Dec 20 '15 at 09:20
  • Try ```Jsoup.connect("http://" + "173.194.117.180");``` that is www.google.com. It works in my environment. But "23.34.229.118" does not work. This IP also does not work in Chrome. I think this server is preventing to access by IP. –  Dec 20 '15 at 09:24
  • @saka1029 Yes, google does indeed work, but most other sites do not. It seems strange, because these sites do work when you try to connect via ip with a Socket object. So you can "get them" with an ip and a socket, but not with Jsoup? There must be a way to define Jsoup accepting ip AND a host name (the same way you'd sent an HTTP request after connecting directly to an ip address and a port). – amirkr Dec 20 '15 at 09:34
  • I think it is not a problem of Jsoup. Web browser like chrome cannot access them by IP. –  Dec 20 '15 at 09:40
  • @saka1029 It's not, these IP addresses (when providing a host) are accessible directly via sockets. See my edited question, if you want :) – amirkr Dec 20 '15 at 09:54
  • You should use socket. You read all text, and remove HTTP header (it ends with empty line) then ```Jsoup.parse(text)```. –  Dec 20 '15 at 10:14
  • @saka1029 but I don't want to use sockets. It adds a lot of complexity since the connection is "raw". Jsoup encapsulates all the difficulties, SSL stuff, bad connections, responses, whatnot, into a library which actually works faster in multithreaded environment. Dealing with the sockets in this scenario is a nightmare. – amirkr Dec 20 '15 at 10:23

1 Answers1

3

I believe I have found the solution.

The following line needs to be added to the code -

System.setProperty("sun.net.http.allowRestrictedHeaders", "true");

This is closely related to this question, since the implementation of Jsoup uses HttpURLConnection: Can I override the Host header where using java's HttpUrlConnection class?

Apparently, java simply blocks (by default) the ability to change some headers, one of which is the host header.

Community
  • 1
  • 1
amirkr
  • 173
  • 3
  • 15