I have the following question.
I am trying to harvest data from the Booking website (for me only, in order to learn the functionality of the rvest package). Everything's good and fine, the package seems to collect what I want and to put everything in the table (dataframe). Here's my code:
library(rvest)
library(lubridate)
library(tidyverse)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
paste0(1:60) %>%
paste0(c("?ie=UTF8&pageNumber=")) %>%
paste0(1:60) %>%
paste0(c("&pageSize=10&sortBy=recent"))
so in this chunk I collect the data from the first 60 pages after first manually feeding the Booking search engine with the country of my choise (Spain), the dates I am interested in (just some arbitrary interval) and the number of people (I used defaults here).
Then, I add this code to select the properties I want:
read_hotel <- function(url){ # collecting hotel names
ho <- read_html(url)
headline <- ho %>%
html_nodes("span.sr-hotel__name") %>% # the node I want to read
html_text() %>%
as_tibble()
}
hotels <- map_dfr(page_booking, read_hotel)
read_pr <- function(url){ # collecting price tags
pr <- read_html(url)
full_pr <- pr %>%
html_nodes("div.bui-price-display__value") %>% #the node I want to read
html_text() %>%
as_tibble()
}
fullprice <- map_dfr(page_booking, read_pr)
... and eventually save the whole data in the dataframe:
dfr <- tibble(hotels = hotels,
price_fact = fullprice)
I collect more parameters but this doesn't matter. The final dataframe of 1500 rows and two columns is then created. But the problem is the data within the second column does not correspond to the data in the first one. Which is really strange and renders my dataframe to be useless. I don't really understand how the package works in the background and why does it behaves that way. I also paid attention the first rows in the first column of the dataframe (hotel name) do not correspond to the first hotels I see on the website. So it seems to be a different search/sort/filter criteria the rvest package uses. Could you please explain me the processes take place during the rvest node hoping? I would really appreciate at least some explanation, just to better understand the tool we work with.