I am developing a simple component to scrape the reviews for hotels from booking.com. I started with using HttpClient to fetch the content of the specific page. Here is one example:
String url = "http://www.booking.com/hotel/sg/"+
"parkroyal-on-pickering.en.html#tab-reviews";
//you can try to load this page in the browser if you want,
//so you will have a better idea about what I am trying to do
GetMethod method = new GetMethod(url);
int returnCode = client.executeMethod(method);
BufferedReader br = new BufferedReader(new InputStreamReader(
method.getResponseBodyAsStream(), "utf-8"));
String readLine;
StringBuilder source = new StringBuilder();
while (((readLine = br.readLine()) != null)) {
source.append(readLine);
source.append("\n");
}
return source;
I was able to get the content and so far so good.
However the problem occurred when I tried to navigate through the pages. The part of the web page containing reviews is dynamically generated by Javascript. When the NextPage button is clicked, the next 25 reviews are retrieved.
I looked at the source code of the web page and found out the actual url to load the reviews, which is something like this:
http://www.booking.com/reviewlist.html?cc1=sg&pagename=parkroyal-on-pickering&offset=25
I tried to open it in the browser, and it worked fine, I was able to see the reviews. However when I used the code I used before, now it just didn't work, and the 400 error code was returned.
So basically, for URL like:
http://www.booking.com/reviewlist.html?cc1=sg&pagename=parkroyal-on-pickering&offset=25
The HttpClient code I used and worked for the first page failed to retrieved the content while the actual browsers(Chrome and IE) were able to load the page.
I am in fact quite new to HttpClient or web page scraping, and any advice or suggestion will be appreciated.
UPDATE: As Rhand suggested, I played around with the requestHeaders and it turned out for the url I tried to call, the following two headers are requred:
method.setRequestHeader("Accept-Language","en-US,en;q=0.8,zh-CN;q=0.6");
method.setRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.76
Safari/537.36");