1

I am trying to write an alert system to scrape complaints board site periodically to look for any complaints about my product. I am using Jsoup for the same. Below is the the code fragment that gives me error.

doc = Jsoup.connect(finalUrl).timeout(10 * 1000).get();

This gives me error

java.net.SocketException: Unexpected end of file from server

When I copy paste the same finalUrl String in the browser, it works. I then tried simple URL connection

            BufferedReader br = null;
            try {
                URL a = new URL(finalUrl);
                URLConnection conn = a.openConnection();

                // open the stream and put it into BufferedReader
                br = new BufferedReader(new InputStreamReader(
                        conn.getInputStream()));
                doc = Jsoup.parse(br.toString());
            } catch (IOException e) {
                e.printStackTrace();
            }

But as it turned out, the connection itself is returning null (br is null). Now the question is, why does the same string when copy pasted in browser opens the site without any error?

Full stacktrace is as below:

java.net.SocketException: Unexpected end of file from server
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:774)
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
    at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:771)
    at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
    at ComplaintsBoardScraper.main(ComplaintsBoardScraper.java:46)
ollo
  • 24,797
  • 14
  • 106
  • 155
rishi
  • 2,564
  • 6
  • 25
  • 47
  • Does your url start with `http://` and does your server allow connections to port `80`? – ollo Mar 11 '13 at 09:45
  • @ollo yes the URL starts with http://. The server is a remote server that is not in my control. Although when I try "nc" command on the server, it says: Connection to complaintsboard.com 80 port [tcp/http] succeeded! – rishi Mar 11 '13 at 14:02
  • Do you have any unescaped characters in the url or is the internetaccess blocked for your application? Does the url redirect to another one? – ollo Mar 11 '13 at 14:14
  • @ollo No. URL seems to be formed properly. I sysout the URL before sending it to Jsoup. When I copy paste the same URL from the console to the browser, it opens the page on the same url without redirect. URL I am trying is http://www.complaintsboard.com/?search=justanswer.com&complaints=Complaints – rishi Mar 11 '13 at 14:40
  • btw. +1 for that question. – ollo Mar 11 '13 at 15:21
  • See https://stackoverflow.com/questions/13670692/403-forbidden-with-java-but-not-web-browser – Raedwald Jul 26 '18 at 12:29

1 Answers1

3

That one was tricky! :-)

The server blocks all requests which don't have a proper user agent. And that’s why you succeeded with your browser but failed with Java.

Fortunately changing user agent is not a big thing in jsoup:

final String url = "http://www.complaintsboard.com/?search=justanswer.com&complaints=Complaints";
final String userAgent = "Mozilla/5.0 (X11; U; Linux i586; en-US; rv:1.7.3) Gecko/20040924 Epiphany/1.4.4 (Ubuntu)";

Document doc = Jsoup.connect(url) // you get a 'Connection' object here
                        .userAgent(userAgent) // ! set the user agent
                        .timeout(10 * 1000) // set timeout
                        .get(); // execute GET request

I've taken the first user agent I found … I guess you can use any valid one instead too.

ollo
  • 24,797
  • 14
  • 106
  • 155
  • I was thinking on the similar lines but never could have gotten to the answer this quick. Thanks a ton for that. – rishi Mar 11 '13 at 15:54
  • 2
    I've been thinking of defaulting the useragent in jsoup to something that looks like a browser, as this catches so many people. Any thoughts on pros/cons of that? – Jonathan Hedley Mar 14 '13 at 23:37
  • 1
    That’s a good idea, but it has one drawback: some websites use the user agent to decide which website is shown to the user (maybe a “desktop” version for a browser (firefox etc.) and a “mobile” version for smartphones (android etc.)). If you set a “desktop” user agent you’ll solve problems like asked here. But on the other side, if one is excepting a “mobile” version he will get the “desktop” version too. – ollo Mar 15 '13 at 16:17
  • 1
    Here are some ideas: **a.)** create a user agent from the local environmental settings, **b.)** create some defaults like one for a desktop pc and one (or more) for mobile devices – you can select the proper one by reading env of used OS, **c.)** create a default one for a desktop and one for mobile devices and let the user decide which one he needs ( `Jsoup.connect(“url here”).setDesktopUserAgent().get()`?) Btw. If possible it might be useful if you make a small hint in documentation about this topic. I guess there are more users which run into this. – ollo Mar 15 '13 at 16:19