3

I'm building a Java application which will download a HTML page from a website and save the file in my local system. I'm able to manually access the web page's URL via browser. But when I try to access the same URL in my Java program, the server returns a 503 Error. Here's the scenario:

sample URL = http://content.somesite.com/demo/somepage.asp

Able to access the above URL via browser. But the below Java code fails to download the page:

StringBuffer data = new StringBuffer();
BufferedReader br = null;
try {
    br = new BufferedReader(new InputStreamReader(sourceUrl.openStream()));
    String inputLine = "";
    while ((inputLine = br.readLine()) != null) {
        data.append(inputLine);
    }
} catch (Exception e) {
    e.printStackTrace();
} finally {
    br.close();
}

So, my questions are:

  1. Am I doing anything wrongly here?

  2. Is there a way for the server to block requests from programs/bots and allow only the requests coming from browsers?

Veera
  • 32,532
  • 36
  • 98
  • 137
  • 1
    As far as question #2 goes, the server could be configured to deny based on the `User-Agent` header or a missing `Referer` header. – ZoogieZork Jan 14 '10 at 07:19
  • 1
    @ZoogieZork: If that's what it's doing, it's misbehaving. 5xx errors are meant to be used for internal server problems. Bot denial should return 4xx errors. – skaffman Jan 14 '10 at 08:20
  • There are a lot of possible causes I can think of (I personally don't think that it's caused by "wrong" user-agent, it would rather have returned a 4xx error). If you dare to post the actual URL in question, then we may provide a better answer. – BalusC Jan 14 '10 at 21:20
  • Hi Friends, Thank you for all your responses. I found out what's the issue for the above error. I was running the above code from my office system which was behind a proxy. So, obviously the code failed to fetch the data since I didn't set the proxy in my Java code. But when I ran the same code in my home system, it ran without a glitch, since my home system is not behind any proxy. – Veera Jan 15 '10 at 07:23

1 Answers1

3

You may want to try setting the User-Agent and Referer HTTP headers to something like what a normal web browser would send.

You can pick a User-Agent string from this list: Seehowitruns: User-agent strings.

In addition, if the page you are requesting is an internal page, it might also depend on cookies which were generated in previous page.

Daniel Vassallo
  • 337,827
  • 72
  • 505
  • 443
  • 2
    In this case however, they probably do not want a bot to access their site. If your program is for more than just private use, you may need to check their terms of service. – Thilo Jan 14 '10 at 07:33