0

I have the following question.

I am trying to harvest data from the Booking website (for me only, in order to learn the functionality of the rvest package). Everything's good and fine, the package seems to collect what I want and to put everything in the table (dataframe). Here's my code:

library(rvest)
library(lubridate)
library(tidyverse)

page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
  paste0(1:60) %>%
  paste0(c("?ie=UTF8&pageNumber=")) %>%
  paste0(1:60) %>%
  paste0(c("&pageSize=10&sortBy=recent"))

so in this chunk I collect the data from the first 60 pages after first manually feeding the Booking search engine with the country of my choise (Spain), the dates I am interested in (just some arbitrary interval) and the number of people (I used defaults here).

Then, I add this code to select the properties I want:

read_hotel <- function(url){  # collecting hotel names
  ho <- read_html(url)
  headline <- ho %>%
    html_nodes("span.sr-hotel__name") %>%  # the node I want to read
    html_text() %>%
    as_tibble()
} 

hotels <- map_dfr(page_booking, read_hotel)

read_pr <- function(url){    # collecting price tags
  pr <- read_html(url)
  full_pr <- pr %>%
    html_nodes("div.bui-price-display__value") %>% #the node I want to read
   html_text() %>%
    as_tibble()
}

fullprice <- map_dfr(page_booking, read_pr)

... and eventually save the whole data in the dataframe:

dfr <- tibble(hotels = hotels,
             price_fact =  fullprice)

I collect more parameters but this doesn't matter. The final dataframe of 1500 rows and two columns is then created. But the problem is the data within the second column does not correspond to the data in the first one. Which is really strange and renders my dataframe to be useless. I don't really understand how the package works in the background and why does it behaves that way. I also paid attention the first rows in the first column of the dataframe (hotel name) do not correspond to the first hotels I see on the website. So it seems to be a different search/sort/filter criteria the rvest package uses. Could you please explain me the processes take place during the rvest node hoping? I would really appreciate at least some explanation, just to better understand the tool we work with.

kshtwork
  • 29
  • 5

1 Answers1

1

You shouldn't scrape hotels' name and price separately like that. What you should do is get all nodes of items (hotels), then scrape the name and price relatively of each hotel. With this method, you can't mess up the order.

library(rvest)
library(purrr)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
  paste0(1:60) %>%
  paste0(c("?ie=UTF8&pageNumber=")) %>%
  paste0(1:60) %>%
  paste0(c("&pageSize=10&sortBy=recent"))


hotels <- 
  map_dfr(
    page_booking,
    function(url) {
      pg <- read_html(url)
      items <- pg %>%
        html_nodes(".sr_item")
      map_dfr(
        items,
        function(item) {
          data.frame(
            hotel = item %>% html_node(xpath = "./descendant::*[contains(@class,'sr-hotel__name')]") %>% html_text(trim = T),
            price = item %>% html_node(xpath = "./descendant::*[contains(@class,'bui-price-display__value')]") %>% html_text(trim = T)
          )
        }
      )
    }
  )

(The dots start the XPath syntaxes present the current node which is the hotel item.)

Update: Update the code that I think faster but still does the job:

hotels <-
  map_dfr(
    page_booking,
    function(url) {
      pg <- read_html(url)
      items <- pg %>%
        html_nodes(".sr_item")
      data.frame(
        hotel = items %>% html_node(xpath = "./descendant::*[contains(@class,'sr-hotel__name')]") %>% html_text(trim = T),
        price = items %>% html_node(xpath = "./descendant::*[contains(@class,'bui-price-display__value')]") %>% html_text(trim = T)
      )
    }
  )
xwhitelight
  • 1,569
  • 1
  • 10
  • 19
  • Thanks for your quick reply. Nice idea. It didn't work fully for me, though. No idea why. The price vector reads only NAs from the node (although it makes no sence since I was able to read the data from the node with my (incorrect) code. Here's the except from the output: .> hotels hotel price 1 Riu Plaza España 2 H10 Andalucía Plaza - Adults only – kshtwork Sep 11 '20 at 16:12
  • 1
    @kshtwork The URLs have a problem now, they don't show price anymore. You can try to change the check-in check-out dates to show the price and update the code. – xwhitelight Sep 11 '20 at 16:17
  • 1
    @kshtwork You can check by getting `page_booking[1]` in the console and paste it to a web browser. – xwhitelight Sep 11 '20 at 16:19