0

I am developing a simple component to scrape the reviews for hotels from booking.com. I started with using HttpClient to fetch the content of the specific page. Here is one example:

  String url = "http://www.booking.com/hotel/sg/"+
             "parkroyal-on-pickering.en.html#tab-reviews";

  //you can try to load this page in the browser if you want, 
  //so you will have a better idea about what I am trying to do

  GetMethod method = new GetMethod(url);

  int returnCode = client.executeMethod(method);

  BufferedReader br = new BufferedReader(new InputStreamReader(
                    method.getResponseBodyAsStream(), "utf-8"));
  String readLine;

  StringBuilder source = new StringBuilder();

  while (((readLine = br.readLine()) != null)) {
    source.append(readLine);
    source.append("\n");
  }

  return source;

I was able to get the content and so far so good.

However the problem occurred when I tried to navigate through the pages. The part of the web page containing reviews is dynamically generated by Javascript. When the NextPage button is clicked, the next 25 reviews are retrieved.

I looked at the source code of the web page and found out the actual url to load the reviews, which is something like this:

http://www.booking.com/reviewlist.html?cc1=sg&pagename=parkroyal-on-pickering&offset=25

I tried to open it in the browser, and it worked fine, I was able to see the reviews. However when I used the code I used before, now it just didn't work, and the 400 error code was returned.

So basically, for URL like:

http://www.booking.com/reviewlist.html?cc1=sg&pagename=parkroyal-on-pickering&offset=25

The HttpClient code I used and worked for the first page failed to retrieved the content while the actual browsers(Chrome and IE) were able to load the page.

I am in fact quite new to HttpClient or web page scraping, and any advice or suggestion will be appreciated.

UPDATE: As Rhand suggested, I played around with the requestHeaders and it turned out for the url I tried to call, the following two headers are requred:

 method.setRequestHeader("Accept-Language","en-US,en;q=0.8,zh-CN;q=0.6");
 method.setRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) 
              AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.76 
              Safari/537.36"); 
usernameTaken
  • 13
  • 1
  • 5

3 Answers3

0

Try adding the HTTP headers your browser generates, like "Accept-Encoding:gzip,deflate,sdch". I got it to work by adding them one by one.

Oded Peer
  • 2,377
  • 18
  • 25
  • Thank you for your advice. I googled for the HTTP header and it seems there are quite a lot headers can be set, also some of them can be set to many different values. Is there any standard I can follow or is there any "default" set of headers I can try? Do you mind telling me which ones you added if you still remember? Thank you. – usernameTaken Jan 28 '14 at 15:12
0

There is probably something wrong with the way you do your request. You can check the headers that your browser is sending and mimic those.

For instance: in google chrome, use the developer tools: View HTTP headers in Google Chrome?

BTW: Booking.com does have an API, you should probably use that: https://secure.booking.com/partnerreg.html

Community
  • 1
  • 1
Rhand
  • 901
  • 7
  • 20
0

Having just made some requests to that URL using wget, it seems that the server needs the following headers to be present in the request - in order to return a 200 OK.

User-Agent: Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8

Accept-Encoding: gzip,deflate,sdch

Without those, the server seems to return a 400 Bad Request (you may be able to play with the header values somewhat).

So in your code, it should just be a case of calling GetMethod.addRequestHeader() with each header above:

GetMethod method = new GetMethod(url);
method.addRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36");
method.addRequestHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
method.addRequestHeader("Accept-Encoding", "gzip,deflate,sdch");
Will Keeling
  • 22,055
  • 4
  • 51
  • 61
  • Thank you for your advice. I already tried as Rhand suggested, it turned our that the following two headers are required: method.setRequestHeader("Accept-Language","en-US,en;q=0.8,zh-CN;q=0.6"); method.setRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.76 Safari/537.36"); – usernameTaken Jan 28 '14 at 16:52