0

I'm trying to read html file from URL. My code works with most of sites except some of them, such as http://dota2.gamepedia.com/Dota_2_Wiki. I guess I need to set java proxy or something?...

Here's my code:

    try {
        URL webPage = new URL("http://dota2.gamepedia.com/Dota_2_Wiki");

        URLConnection con = webPage.openConnection();
        con.setConnectTimeout(5000);
        con.setReadTimeout(5000);

        BufferedReader in = new BufferedReader(
                            newInputStreamReader(con.getInputStream()));

        String inputLine;
        while ((inputLine = in.readLine()) != null)
            System.out.println(inputLine);

        in.close();
    }
    catch (MalformedURLException exc){exc.printStackTrace();}
    catch (IOException exc){exc.printStackTrace();}

As the result:

java.io.IOException: Server returned HTTP response code: 403 for URL: http://dota2.gamepedia.com/Dota_2_Wiki
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1838)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1439)
at com.Popov.Main.main(Main.java:17)

Error code 403: How can I get access to it? Btw, it works correctly in browser

Vadim Popov
  • 1,177
  • 8
  • 17

3 Answers3

2

Most likely your problem is because of not setting up user agent properly. for you guys who love vanilla java. these are the codes

private void sendGet() throws Exception {

    String url = "http://dota2.gamepedia.com/Dota_2_Wiki";

    URL obj = new URL(url);
    CookieHandler.setDefault(new CookieManager(null, CookiePolicy.ACCEPT_ALL));
    HttpURLConnection con = (HttpURLConnection) obj.openConnection();

    con.setRequestMethod("GET");
    con.setRequestProperty("User-Agent", USER_AGENT);

    int responseCode = con.getResponseCode();
    System.out.println("\nSending 'GET' request to URL : " + url);
    System.out.println("Response Code : " + responseCode);

    BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
    String inputLine;
    StringBuffer response = new StringBuffer();

    while ((inputLine = in.readLine()) != null) {
        response.append(inputLine);
    }
    in.close();

    System.out.println(response.toString());

}

note that you also need to setup the cookie because when i try it without it, the code will give me to many redirect loop

kucing_terbang
  • 4,991
  • 2
  • 22
  • 28
1

You can simple try using jsoup html parser.See sample code;

public static void main(String[] args) throws IOException {

        Document doc = Jsoup
                .connect("http://dota2.gamepedia.com/Dota_2_Wiki")
                .userAgent(
                        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36")
                .timeout(0).followRedirects(true).execute().parse();
        Elements titles = doc.select(".entrytitle");

        // print all titles in main page
        for (Element e : titles) {
            System.out.println("text: " + e.text());
            System.out.println("html: " + e.html());
        }

        // print all available links on page
        Elements links = doc.select("a[href]");
        for (Element l : links) {
            System.out.println("link: " + l.attr("abs:href"));
        }

    }
Sai Ye Yan Naing Aye
  • 6,622
  • 12
  • 47
  • 65
0

I think your problem here is that the server doesn't accept your "user agent" string and returns a 403 forbidden code.

One answer suggested using Jsoup and setting the user agent manually, but didn't explain that setting the user agent is the crucial step. You could use that approach.

Or, you could read Setting user agent of a java URLConnection and set the user agent of the URLConnection yourself. This approach doesn't need any external libraries.

Community
  • 1
  • 1
juhist
  • 4,210
  • 16
  • 33