0

I am scraping hotel reviews from the following TripAdvisor page:

library(rvest)
web <- read_html("https://www.tripadvisor.es/Hotel_Review-g187507-d228530-Reviews-Melia_Maria_Pita-La_Coruna_Province_of_A_Coruna_Galicia.html")

I want to get the rating dates in order to express the number of Reviews per week, but I am not able to express the date in the appropiate format. I tried the following, but it gives me

[1] NA NA NA NA NA
dateComment<-web%>%
  html_nodes(".location-review-review-list-parts-EventDate__event_date--1epHa")%>%
  html_attr("title")

I have also tried with html_text which gives me the date in written format. However, when I tried to express it as date format it gives me error: do not know how to convert 'df$fechaComentarios' to class “Date”

dateComment<-web%>%
  html_nodes(".location-review-review-list-parts-EventDate__event_date--1epHa")%>%html_text() 

df$dateComment=gsub("de","",df$dateComment)
df$date <- as.Date(df$fechaComentarios, format = "%d %B %Y")

Thank you in advance!

Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
Maria
  • 37
  • 4

2 Answers2

2

As @MacOS points out, you need to extract the text. That text string also comes along with Fecha de la estancia:, but you can easily get rid of that using str_extract.

The main issue is that you're trying to parse a date without a day into class POSIXct, which can't work.

One approach would be to paste on an arbitrary day, say 01 and then use lubridate::parse_date_time to parse.

library(rvest)
library(stringr)
library(lubridate)
web%>%
  html_nodes(
    ".location-review-review-list-parts-EventDate__event_date--1epHa") %>%
  html_text %>%
  str_extract("(?<=: ).+") %>%
  paste0("01 ",.) %>%
  parse_date_time("%d %B %Y",locale = "es_ES")
#[1] "2020-03-01 UTC" "2019-10-01 UTC" "2020-02-01 UTC" "2020-02-01 UTC" "2020-02-01 UTC"

You can probably skip the locale = "es_ES" argument, but I'm in the US and using English, so I had to use that.

Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
0

Your code does not work because you did not properly extract the date.

library(rvest)

web <- read_html("https://www.tripadvisor.es/Hotel_Review-g187507-d228530-Reviews-Melia_Maria_Pita-La_Coruna_Province_of_A_Coruna_Galicia.html")

dateComment<-web%>%
  html_nodes(".location-review-review-list-parts-EventDate__event_date--1epHa")%>%html_text()

head(dateComment)

Leads to

c("Fecha de la estancia: marzo de 2020", "Fecha de la estancia: octubre de 2019", 
"Fecha de la estancia: febrero de 2020", "Fecha de la estancia: febrero de 2020", 
"Fecha de la estancia: febrero de 2020")

The following extracts the date correctly.

dateComment <- strsplit(dateComment, ": ")

dateComment <- unlist(lapply(dateComment, FUN = function(x) {x[2]}))

head(dateComment)
> [1] "marzo de 2020"   "octubre de 2019" "febrero de 2020" "febrero de 2020" "febrero de 2020"

For as.Date to work correctly, you have to use the correct time setting. See here for details.

MacOS
  • 1,149
  • 1
  • 7
  • 14