I am trying to get the html of MyAnimeList.net (specifically this page: http://myanimelist.net/anime.php?q=toradora!
), I am using a method that has worked for me before on different websites but doesn't work for me here.
The Method I use:
public String getWebsiteSourceCode(String sURL){
try{
URL url = new URL(sURL);
URLConnection urlConn= url.openConnection();
//NEW LINE
urlConn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36");
BufferedReader in = new BufferedReader(new InputStreamReader(
urlConn.getInputStream(), "UTF-8"));
String inputLine;
StringBuilder a = new StringBuilder();
while ((inputLine = in.readLine()) != null)
a.append(inputLine);
in.close();
return a.toString();
}catch(Exception e){
e.printStackTrace();
return "null";
}
}
What I get:
<html><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"></head><iframe src="/_Incapsula_Resource?CWUDNSAI=9&incident_id=124000930038292057-125560654487356886&edet=12&cinfo=464f095fc75381e904000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 124000930038292057-125560654487356886</iframe></html>
What I should be getting: The html code of the webpage (I can get it in google chrome by right click + view page source, and it is completely different from what I get from my method).
From what I get, it says something about ROBOTS, so I assume the website has cookies or something to track whether I am using a browser or a bot... What I want to know is whether or not it is possible to bypass this, and how would I go upon doing so? Thanks for your help :) (Preferably in Java, since that is what I am using)
EDIT: tried adding this line:
urlConn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36");
but I get the same error...